SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-2
UNIT-II LEXICAL ANALYSIS
2 MARKS
1. What is Lexical Analysis?
The first phase of compiler is Lexical Analysis. This is also known as linear analysis
i
n
which the stream of characters making up the source program is read from
left-to-right and grouped into tokens that are sequences of characters having a
collective meaning.
2. What is a lexeme? Define a regular set.
A Lexeme is a sequence of characters in the source program that is matched by
the pattern for a token.
A language denoted by a regular expression is said to be a regular set
3. What is a sentinel? What is its usage?
A Sentinel is a special character that cannot be part of the source program. Normally
w
euse ‘eof as the sentinel. This is used for speeding-up the lexical analyzer.
4. What is a regular expression? State the rules, which define regular expression?
Regular expression is a method to describe regular language
Rules:
1) ε-is a regular expression that denotes {ε} that is the set containing the empty
s
t
r
i
ng
2) If a is a symbol in
,then
a is a regular expression that denotes {a}
3) Suppose r and s are regular expressions denoting the languages L(r ) and L(s)
Th
e
n,
a) (r )/(s) is a regular expression denoting L(r) U L(s).
b) (r )(s) is a regular expression denoting L(r )L(s)
c) (r )* is a regular expression denoting L(r)*.
d) (r) is a regular expression denoting L(r ).
5. What are the Error-recovery actions in a lexical analyzer?
1. Deleting an extraneous character
2. Inserting a missing character
3. Replacing an incorrect character by a correct character
4. Transposing two adjacent characters
6. Construct Regular expression for the language
L= {w ε{a,b}/w ends in abb}
Ans: {a/b}*abb.
7. What is recognizer?
Recognizers are machines. These are the machines which accept the strings belonging to
certain
language. If the valid strings of such language are accepted by the machine then it is said that
the corresponding language is accepted by that machine, otherwise it is rejected.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-2
8. Differentiate compiler and interpreter.
Compiler produces a target program whereas an interpreter performs the
operations implied by the source program.
9. Write short notes on buffer pair.
Concerns with efficiency issues
Used with a lookahead on the input
It is a specialized buffering technique used to reduce the overhead required to
process an input character. Buffer is divided into two N-character halves. Use two
pointers. Used at times when the lexical analyzer needs to look ahead several characters
beyond the lexeme for a pattern before a match is announced.
10. Differentiate tokens, patterns, lexeme.
Tokens- Sequence of characters that have a collective meaning.
Patterns- There is a set of strings in the input for which the same token is produced as
output. This set of strings is described by a rule called a pattern associated with the
token
Lexeme- A sequence of characters in the source program that is matched by the
pattern for a token.
11. List the operations on languages.
Union - L U M ={s | s is in L or s is in M}
Concatenation LM ={st | s is in L and t is in M}
Kleene Closure L* (zero or more concatenations of L)
Positive Closure L+ ( one or more concatenations of L)
12. Write a regular expression for an identifier.
An identifier is defined as a letter followed by zero or more letters or digits.The
regular expression for an identifier is given as letter (letter | digit)*
13. Mention the various notational shorthands for representing regular expressions.
One or more instances (+)
Zero or one instance (?)
Character classes ([abc] where a,b,c are alphabet symbols denotes the regular
expressions a | b | c.)
Non regular sets
14. What is the function of a hierarchical analysis?
Hierarchical analysis is one in which the tokens are grouped hierarchically into nested
collections with collective meaning. Also termed as Parsing.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-2
15. What does a semantic analysis do?
Semantic analysis is one in which certain checks are performed to ensure that
components of a program fit together meaningfully. Mainly performs type checking.
16 MARKS
1)What are roles and tasks of a lexical analyzer?
Main Task: Take a token sequence from the scanner and verify that it is a syntactically correct
program.
Secondary Tasks:
Process declarations and set up symbol table information accordingly, in preparation for
semantic analysis.
Construct a syntax tree in preparation for intermediate code generation.
2. Converting a Regular Expression into a Deterministic Finite Automaton
The task of a scanner generator, such as JLex, is to generate the transition tables or to synthesize
the scanner program given a scanner specification (in the form of a set of REs). So it needs to
convert REs into a single DFA. This is accomplished in two steps: first it converts REs into a
non-deterministic finite automaton (NFA) and then it converts the NFA into a DFA.
An NFA is similar to a DFA but it also permits multiple transitions over the same character and
transitions over . In the case of multiple transitions from a state over the same character, when
we are at this state and we read this character, we have more than one choice; the NFA succeeds
if at least one of these choices succeeds. The transition doesn't consume any input characters,
so you may jump to another state for free.
Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same
expressive power. The problem is that when converting a NFA to a DFA we may get an
exponential blowup in the number of states.
We will first learn how to convert a RE into a NFA. This is the easy part. There are only 5 rules,
one for each type of RE:
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-2
As it can been shown inductively, the above rules construct NFAs with only one final state. For
example, the third rule indicates that, to construct the NFA for the RE AB, we construct the
NFAs for A and B, which are represented as two boxes with one start state and one final state for
each box. Then the NFA for AB is constructed by connecting the final state of A to the start state
of B using an empty transition.
For example, the RE (a| b)c is mapped to the following NFA:
The next step is to convert a NFA to a DFA (called subset construction). Suppose that you assign
a number to each NFA state. The DFA states generated by subset construction have sets of
numbers, instead of just one number. For example, a DFA state may have been assigned the set
{5, 6, 8}. This indicates that arriving to the state labeled {5, 6, 8} in the DFA is the same as
arriving to the state 5, the state 6, or the state 8 in the NFA when parsing the same input. (Recall
that a particular input sequence when parsed by a DFA, leads to a unique state, while when
parsed by a NFA it may lead to multiple states.)
First we need to handle transitions that lead to other states for free (without consuming any
input). These are the transitions. We define the closure of a NFA node as the set of all the
nodes reachable by this node using zero, one, or more transitions. For example, The closure of
node 1 in the left figure below
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-2
is the set {1, 2}. The start state of the constructed DFA is labeled by the closure of the NFA start
state. For every DFA state labeled by some set {s
1
,..., s
n
} and for every character c in the
language alphabet, you find all the states reachable by s
1
, s
2
, ..., or s
n
using c arrows and you
union together the closures of these nodes. If this set is not the label of any other node in the
DFA constructed so far, you create a new DFA node with this label. For example, node {1, 2} in
the DFA above has an arrow to a {3, 4, 5} for the character a since the NFA node 3 can be
reached by 1 on a and nodes 4 and 5 can be reached by 2. The b arrow for node {1, 2} goes to
the error node which is associated with an empty set of NFA nodes.
The following NFA recognizes (a| b)
*
(abb | a
+
b), even though it wasn't constructed with the
above RE-to-NFA rules. It has the following DFA:
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY QUESTION BANK
CS 6660 COMPILER DESIGN UNIT III
UNIT III SYNTAX ANALYSIS
1. Define parser.
Hierarchical analysis is one in which the tokens are grouped hierarchically into nested
collections with collective meaning.
Also termed as Parsing.
2. Mention the basic issues in parsing.
There are two important issues in parsing.
Specification of syntax
Representation of input after parsing.
3. Why lexical and syntax analyzers are separated out?
Reasons for separating the analysis phase into lexical and syntax analyzers:
Simpler design.
Compiler efficiency is improved.
Compiler portability is enhanced.
4. Define a context free grammar.
A context free grammar G is a collection of the following
V is a set of non terminals
T is a set of terminals
S is a start symbol
P is a set of production rules
G can be represented as G = (V,T,S,P)
Production rules are given in the following form
Non terminal → (V U T)*
5. Briefly explain the concept of derivation.
Derivation from S means generation of string w from S. For constructing derivation two
things are important.
i) Choice of non terminal from several others.
ii) Choice of rule from production rules for corresponding non terminal.
Instead of choosing the arbitrary non terminal one can choose
i) either leftmost derivation leftmost non terminal in a sentinel form
ii) or rightmost derivation rightmost non terminal in a sentinel form
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY QUESTION BANK
CS 6660 COMPILER DESIGN UNIT III
6. Define ambiguous grammar.
A grammar G is said to be ambiguous if it generates more than one parse tree for some
sentence of language L(G).
i.e. both leftmost and rightmost derivations are same for the given sentence.
7. What is a operator precedence parser?
A grammar is said to be operator precedence if it possess the following properties:
1. No production on the right side is ε.
2. There should not be any production rule possessing two adjacent non terminals at the
right hand side.
8. List the properties of LR parser.
1. LR parsers can be constructed to recognize most of the programming languages for
which the context free grammar can be written.
2. The class of grammar that can be parsed by LR parser is a superset of class of
grammars that can be parsed using predictive parsers.
3. LR parsers work using non backtracking shift reduce technique yet it is efficient one.
9. Mention the types of LR parser.
SLR parser- simple LR parser
LALR parser- lookahead LR parser
Canonical LR parser
10. What are the problems with top down parsing?
The following are the problems associated with top down parsing:
Backtracking
Left recursion
Left factoring
Ambiguity
11. Write the algorithm for FIRST and FOLLOW.
FIRST
1. If X is terminal, then FIRST(X) IS {X}.
2. If X → ε is a production, then add ε to FIRST(X).
3. If X is non terminal and X Y1,Y2..Yk is a production, then place a in FIRST(X) if
for some i , a is in FIRST(Yi) , and ε is in all of FIRST(Y1),…FIRST(Yi-1);
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY QUESTION BANK
CS 6660 COMPILER DESIGN UNIT III
FOLLOW
1. Place $ in FOLLOW(S),where S is the start symbol and $ is the input right endmarker.
2. If there is a production A → αBβ, then everything in FIRST(β) except for ε is placed in
FOLLOW(B).
3. If there is a production A → αB, or a production A→ αBβ where FIRST(β) contains ε ,
then everything in FOLLOW(A) is in FOLLOW(B).
12. List the advantages and disadvantages of operator precedence parsing.
Advantages
This typeof parsing is simple to implement.
Disadvantages
1. The operator like minus has two different precedence(unary and binary).Hence it is
hard to handle tokens like minus sign.
2. This kind of parsing is applicable to only small class of grammars.
13. What is dangling else problem?
Ambiguity can be eliminated by means of dangling-else grammar which is show below:
stmt → if expr then stmt
| if expr then stmt else stmt
| other
14. Write short notes on YACC.
YACC is an automatic tool for generating the parser program.
YACC stands for Yet Another Compiler Compiler which is basically the utility available
from UNIX.
Basically YACC is LALR parser generator.
It can report conflict or ambiguities in the form of error messages.
15. What is meant by handle pruning?
A rightmost derivation in reverse can be obtained by handle pruning.
If w is a sentence of the grammar at hand, then w = γn, where γn is the nth right-
sentential form of some as yet unknown rightmost derivation
S = γ0 => γ1…=> γn-1 => γn = w
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY QUESTION BANK
CS 6660 COMPILER DESIGN UNIT III
16. Define LR(0) items.
An LR(0) item of a grammar G is a production of G with a dot at some position of the
right side. Thus, production A → XYZ yields the four items
A→.XYZ
A→X.YZ
A→XY.Z
A→XYZ.
17. What is meant by viable prefixes?
The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce
parser are called viable prefixes. An equivalent definition of a viable prefix is that it is a
prefix of a right sentential form that does not continue past the right end of the rightmost
handle of that sentential form.
18. Define handle.
A handle of a string is a substring that matches the right side of a production, and whose
reduction to the nonterminal on the left side of the production represents one step along the
reverse of a rightmost derivation.
A handle of a right sentential form γ is a production A→β and a position of γ where the
string β may be found and replaced by A to produce the previous right-sentential form in a
rightmost derivation of γ. That is , if S =>αAw =>αβw,then A→β in the position following α
is a handle of αβw.
19. What are kernel & non-kernel items?
Kernel items, whish include the initial item, S'→ .S, and all items whose dots are not at
the left end.
Non-kernel items, which have their dots at the left end.
20. What is phrase level error recovery?
Phrase level error recovery is implemented by filling in the blank entries in the predictive
parsing table with pointers to error routines. These routines may change, insert, or delete
symbols on the input and issue appropriate error messages. They may also pop from the
stack.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY QUESTION BANK
CS 6660 COMPILER DESIGN UNIT III
PART B
1) Construct a predictive parsing.table for the grammar
E -> E + T / F
T -> T * F / F
F -> (E) / id
2) Give the LALR parsing table for the grammar.
S -> L = R / R
L -> * R / id
R -> L
3) Consider the grammar
E -> TE’
E’ -> + TE’ / E
T -> FT’
T’ -> *FT’ / E
F -> (E) / id
Construct a predictive parsing table for the grammar shown above. Verify whether the
input string
id + id * id is accepted by the grammar or not.
4) Consider the grammar.
E -> E + T
E -> T
T -> T * F
T -> F
F -> (E) / id
Construct an LR parsing table for the above grammar. Give the moves of the LR parser
on
id * id + id
5) For the grammar given below, calculate the operator precedence relation and
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY QUESTION BANK
CS 6660 COMPILER DESIGN UNIT III
the precedence functions
E -> E + E | E E |
| E * E | E | E |
| E ^ E | (E) |
| -E | id
6) Compare top down parsing and bottom up parsing methods.
7) What are LR parsers? Explain with a diagram the LR parsing algorithm.
8) What are parser generators?
9) Explain recursive descent parser with appropriate examples.
10) Compare SLR, LALR and LR parses.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-IV
UNIT IV - SYNTAX DIRECTED TRANSLATION & RUN TIME
ENVIRONMENT
1. List the different storage allocation strategies.
The strategies are:
Static allocation
Stack allocation
Heap allocation
2. What are the contents of activation record?
The activation record is a block of memory used for managing the information needed by
a single execution of a procedure. Various fields f activation record are:
Temporary variables
Local variables
Saved machine registers
Control link
Access link
Actual parameters
Return values
3. What is dynamic scoping?
In dynamic scoping a use of non-local variable refers to the non-local data declared in
most recently called and still active procedure. Therefore each time new findings are set
up for local names called procedure. In dynamic scoping symbol tables can be required at
run time.
4. Define symbol table.
Symbol table is a data structure used by the compiler to keep track of semantics of the
variables. It stores information about scope and binding information about names.
5. What is code motion?
Code motion is an optimization technique in which amount of code in a loop is
decreased. This transformation is applicable to the expression that yields the same
result independent of the number of times the loop is executed. Such an expression is
placed before the loop.
6. What are the properties of optimizing compiler?
The source code should be such that it should produce minimum amount of target
code.
There should not be any unreachable code.
Dead code should be completely removed from source language.
The optimizing compilers should apply following code improving transformations on
source language.
i) common subexpression elimination
ii) dead code elimination
iii) code movement
iv) strength reduction
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-IV
7. What are the various ways to pass a parameter in a function?
Call by value
Call by reference
Copy-restore
Call by name
8. Suggest a suitable approach for computing hash function.
Using hash function we should obtain exact locations of name in symbol table.
The hash function should result in uniform distribution of names in symbol table.
The hash function should be such that there will be minimum number of collisions.
Collision is such a situation where hash function results in same location for storing the
names.
9. Define bottom up parsing?
It attempts to construct a parse tree for an input string is beginning at leaves and
working up towards the root (i.e.) reducing a string
w
to the start symbol of a
grammar. At each reduction step, a particular substring matching the right side of a
production is replaced by the symbol on the left of that production. It is a rightmost
derivation and it
s also known as shifts reduce parsing.
10. What are the functions used to create the nodes of syntax trees?
Mknode (op, left, right)
Mkleaf (id,entry)
Mkleaf (num, val)
11. What are the functions for constructing syntax trees for expressions?
i) The construction of a syntax tree for an expression is similar to the translation
of the expression into postfix form.
ii) Each node in a syntax tree can be implemented as a record with several fields.
12. Give short note about call-by-name?
Call by name, at every reference to a formal parameter in a procedure body the
name of the corresponding actual parameter is evaluated. Access is then made to the
effective parameter.
13. How parameters are passed to procedures in call-by-value method?
This mechanism transmits values of the parameters of call to the called
program. The transfer is one way only and therefore the only way to returned can be
the value of a function.
Main ( )
{ print (5);
} Int
Void print (int n)
{ printf (%d”, n); }
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-IV
14. Define static allocations and stack allocations
Static allocation is defined as lays out for all data objects at compile
time.
Names are bound to storage as a program is compiled, so there is no need for
a run time
support package.
Stack allocation is defined as process in which manages the run time as a
Stack. It is based
on the idea of a control stack; storage is organized as a
stack, and activation records are
pushed and popped as activations begin and end.
15. What are the difficulties with top down parsing?
a) Left recursion
b) Backtracking
c) The order in which alternates are tried can affect the language accepted
d) When failure is reported. We have very little idea where the error actually occurred.
16. Define top down parsing?
It can be viewed as an attempt to find the left most derivation for an input string. It
can be viewed as attempting to construct a parse tree for the input starting from the root and
creating the nodes of the parse tree in preorder.
17. Define a syntax-directed translation?
Syntax-directed translation specifies the translation of a construct in terms of Attributes
associated with its syntactic components. Syntax-directed translation uses a context free
grammar to specify the syntactic structure of the input. It is an input- output mapping.
18. Define an attribute. Give the types of an attribute?
An attribute may represent any quantity, with each grammar symbol, it
associates a set of attributes and with each production, a set of semantic rules for
computing values of the attributes associated with the symbols appearing in that production.
Example: a type, a value, a memory location etc.,
i) Synthesized attributes.
ii) Inherited attributes.
19. Give the 2 attributes of syntax directed translation into 3-addr code?
i) E.place, the name that will hold the value of E and
ii) E.code , the sequence of 3-addr statements evaluating E.
20. Write the grammar for flow-of-control statements?
The following grammar generates the flow-of-control statements, if-then, if-
then-else, and while-do statements.
S -> if E
then S1
| If E then S1
else S2
| While E do S1.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-IV
PART B
1) Discuss in detail about the run time storage arrangement.
2) What are different storage allocation strategies. Explain.
3) Write in detail about the issues in the design of code generator.
4) Explain how declarations are done in a procedure using syntax directed translations.
5) What is a three address code? Mention its types. How would you implement these
address statements? Explain with suitable examples.
6) Write syntax directed translation for arrays.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT V
UNIT V CODE OPTIMISATION AND CODE GENERATION
1. Define code generations with ex?
It is the final phase in compiler model and it takes as an input an
intermediate representation of the source program and output produces as equivalent target
programs. Then intermediate instructions are each translated
into a sequence of machine instructions that perform the same task.
2. What are the issues in the design of code generator?
Input to the generator
Target programs
Memory management
Instruction selection
Register allocation
Choice of evaluation order
Approaches to code generation.
3. Give the variety of forms in target program.
Absolute machine language.
Relocatable machine language.
Assembly language.
4. Give the factors of instruction selections.
Uniformity and completeness of the instruction sets
Instruction speed and machine idioms
Size of the instruction sets.
5. What are the sub problems in register allocation strategies?
During register allocation, we select the set of variables that will reside in register at a
point in the program.
During a subsequent register assignment phase, we pick the specific register that a
variable reside in.
6. Give the standard storage allocation strategies.
Static allocation
Stack allocation.
7. Define static allocations and stack allocations
Static allocation is defined as lays out for all data objects at compile
time.
Names are
bound to storage as a program is compiled, so there is no need for
a Run time support package.
Stack allocation is defined as process in which manages the run time as a
Stack. It is based
on the idea of a control stack; storage is organized as a
stack, And activation records are
pushed and popped as activations begin and end.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT V
8. Write the addressing mode and associated costs in the target machine.
MODE FORM ADDRESS ADDED COST
Absolute M M 1
Register R R 0
Indexed c(R) c+contents(R) 1
Indirect register *R contents(R) 0
Indirect indexed *c(R) contents(c+contents(R)) 1
9. Define basic block and flow graph.
A basic block is a sequence of consecutive statements in which flow of Control enters at the
beginning and leaves at the end without halt or possibility Of branching except at the end.
A flow graph is defined as the adding of flow of control information to
the Set of basic blocks
making up a program by constructing a directed graph.
10. Write the step to partition a sequence of 3 address statements into basic blocks.
1. First determine the set of leaders, the first statement of basic blocks.
The rules we can use are the following.
The first statement is a leader.
Any statement that is the target of a conditional or unconditional goto is a leader.
Any statement that immediately follows a goto or conditional goto statement is a leader.
2. For each leader, its basic blocks consists of the leader and all statements
Up to but not including the next leader or the end of the program.
11. Give the important classes of local transformations on basic blocks
Structure preservation transformations
Algebraic transformations.
12. Describe algebraic transformations.
It can be used to change the set of expressions computed by a basic blocks
into A algebraically equivalent sets. The useful ones are those that simplify the
Expressions place expensive operations by cheaper ones.
X = X+ 0
X = X * 1
13. What is meant by register descriptors and address descriptors?
A register descriptor keeps track of what is currently in each register. It is
Consulted whenever a new register is needed.
An address descriptor keeps track of the location where ever the current Value of the
name can be found at run time. The location might be a register, a Stack location, a memory
address,
14. What are the actions to perform the code generation algorithms?
Invoke a function get reg to determine the location L.
Consult the address descriptor for y to determine
y,
the current location of y.
If the current values of y and/or z have no next uses, are not live on exit from the block,
and are in register, alter the register descriptor.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT V
15. Write the code sequence for the d:=(a-b)+(a-c)+(a-c).
Statement Code generation Register descriptor Address
descriptor
t:=a-b MOV a,R0
SUB b,R0
R0 contains t t in R0
u:=a-c MOV a,R1
SUB c,R1
R0 contains t
R1 contains u
t in R0
u in R1
v:=t+u ADD R1,R0 R0 contains v
R1 contains u
u in R1
v in R0
d:=v+u ADD R1,R0
MOV R0,d
R0 contains d d in R0
d in R0 and
memory
16. Write the labels on nodes in DAG.
A DAG for a basic block is a directed acyclic graph with the following Labels on nodes:
Leaves are labeled by unique identifiers, either variable names or constants.
Interior nodes are labeled by an operator symbol.
Nodes are also optionally given a sequence of identifiers for labels.
17. Give the applications of DAG.
Automatically detect the common sub expressions
Determine which identifiers have their values used in the block.
Determine which statements compute values that could be used outside the blocks.
18. Define Peephole optimization.
A Statement by statement code generation strategy often produces target code that contains
redundant instructions and suboptimal constructs. Optimizingis misleading because there is
no guarantee that the resulting code is optimal. It is a method for trying to improve the
performance of the target program by examining the short sequence of target instructions
and replacing this instructions by shorter or faster sequence.
19. Write the characteristics of peephole optimization?
Redundant-instruction elimination
Flow-of-control optimizations.
Algebraic simplifications
Use of machine idioms
20. What are the structure preserving transformations on basic blocks?
Common sub-expression elimination
Dead-code elimination
Renaming of temporary variables
Interchange of two independent adjacent statement
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT V
21. Define Common sub-expression elimination with ex.
It is defined as the process in which eliminate the statements which has
the
Same expressions. Hence this basic block may be transformed into the equivalent
Block.
Ex:
a : =b + c b
:=a - d
c :=b + c
After elimination:
a : =b + c b
:=a - d
c :=a
22. Define Dead-code elimination with ex.
It is defined as the process in which the statement x=y+z appear in a basic block, where x is
a dead that is never subsequently used. Then this statement maybe safely removed without
changing the value of basic blocks.
23. Define Renaming of temporary variables with ex.
We have the statement u:=b + c ,where u is a new temporary variable, and change all uses of
this instance of t to u, then the value of the basic block is not changed.
24. Define reduction in strength with ex.
Reduction in strength replaces expensive operations by equivalent cheaper ones
on the target machines. Certain machine instructions are cheaper than others and can often
be used as special cases of more expensive operators. Ex:
X^2 is invariably cheaper to implement as x*x than as a call to an
exponentiation routine.
25. Define use of machine idioms.
The target machine may have harder instructions to implement certain specific operations
efficiently. Detecting situations that permit the use of these instructions can reduce execution
time significantly.
26. Define code optimization and optimizing compiler
The term code-optimization refers to techniques a compiler can employ in an attempt to
produce a better object language program than the most obvious for a given source program.
Compilers that apply code-improving transformations are called Optimizing-compilers.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT V
PART B:
1. What are the issues in the design of code generator? Explain in detail.
2. Discuss about the run time storage management.
3. Explain basic blocks and flow graphs.
4. Explain about transformation on a basic block.
5. Write a code generation algorithm. Explain about the descriptor and function getreg().Give
an example.
6. Explain peephole optimization
7. Explain DAG representation of basic blocks.
8. Explain principle sources of code optimization in details.
9. Explain the Source language issues with details.
10. Explain the Storage organization strategies with examples.
11. Explain storage allocation strategy.
12. Explain about Parameter passing.
13. Explain the non local names in runtime storage managements.
14. Explain about activation records and its purpose.
15. Explain about Optimization of basic blocks.
16. Explain the various approaches to compiler development.
17. Explain simple code generator with suitable example.
18. Discuss about the following:
a) Copy Propagation b) Dead-code Elimination and c) Code motion
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-1
UNIT I INTRODUCTION TO COMPILER
1. What is a Complier?
A Complier is a program that reads a program written in one language-the source
language-and translates it in to an equivalent program in another language-the target language .
As an important part of this translation process, the compiler reports to its user the presence of
errors in the source program
2. State some software tools that manipulate source program?
i. Structure editors
ii. Pretty printers
iii. Static
iv. checkers
v. Interpreters.
3. What are the cousins of compiler? April/May 2004, April/May 2005
The following are the cousins of
i. Preprocessors
ii.Assemblers
iii. Loaders
iv. Link editors.
4. What are the main two parts of compilation? What are they performing?
The two main parts are
Analysis part breaks up the source program into constituent pieces and creates an
intermediate representation of the source program.
Synthesis part constructs the desired target program from the intermediate representation
5. What is a Structure editor?
A structure editor takes as input a sequence of commands to build a source program .The
structure editor not only performs the text creation and modification functions of an ordinary text
editor but it also analyzes the program text putting an appropriate hierarchical structure on the
source program.
6. What are a Pretty Printer and Static Checker?
A Pretty printer analyses a program and prints it in such a way that the structure of the
program becomes clearly visible.
A static checker reads a program, analyses it and attempts to discover potential bugs with
out running the program.
7. How many phases does analysis consists?
Analysis consists of three phases
i .Linear analysis
ii. Hierarchical analysis
iii. Semantic analysis
8. What happens in linear analysis?
This is the phase in which the stream of characters making up the source program is read
from left to right and grouped in to tokens that are sequences of characters having collective
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-1
meaning.
9. What happens in Hierarchical analysis?
This is the phase in which characters or tokens are grouped hierarchically in to nested
collections with collective meaning.
10. What happens in Semantic analysis?
This is the phase in which certain checks are performed to ensure that the components of
a program fit together meaningfully.
11. State some compiler construction tools?Arpil /May 2008
i. Parse generator
ii. Scanner generators
iii. Syntax-directed translation engines
iv. Automatic code generator
v. Data flow engines.
12. What is a Loader? What does the loading process do?
A Loader is a program that performs the two functions
i. Loading
ii .Link editing
The process of loading consists of taking relocatable machine code, altering the relocatable
address and placing the altered instructions and data in memory at the proper locations.
13. What does the Link Editing does?
Link editing: This allows us to make a single program from several files of relocatable
machine code. These files may have been the result of several compilations, and one or more
may be library files of routines provided by the system and available to any program that needs
them.
14. What is a preprocessor?
A preprocessor is one, which produces input to compilers. A source program may be
divided into modules stored in separate files. The task of collecting the source program is
sometimes entrusted to a distinct program called a preprocessor.
The preprocessor may also expand macros into source language
s
tate
m
e
n
t
s.
Skeletalsource
progr
a
m
Pr
e
pro
ce
ssor
Source
progr
a
m
15. State some functions of
P
re
p
r
o
ce
sso
r
s
i) Macro processing
ii) File inclusion
iii) Relational Preprocessors
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-1
iv) Language extensions
16. What is a Symbol table?
A Symbol table is a data structure containing a record for each identifier, with fields for
the attributes of the identifier. The data structure allows us to find the record for each identifier
quickly and to store or retrieve data from that record quickly.
17. State the general phases of a c
omp
iler
i. Lexical a
n
al
ys
i
s
ii. Syntax analysis
iii. Semantic analysis
iv. Intermediate code generation
v. Code optimization
vi. Code
g
e
n
e
r
ati
on
18. What is an assembler?
Assembler is a program, which converts the source language in to assembly language.
19. What is the need for separating the analysis phase into lexical analysis and parsing?
(Or) What are the issues of lexical analyzer?
Simpler design is perhaps the most important consideration. The separation of lexical
analysis from syntax analysis often allows us to simplify one or the other of these phases.
Compiler efficiency is i
mprov
e
d.
Compiler portability is e
nh
a
n
ce
d.
20. What is Lexical Analysis?
The first phase of compiler is Lexical Analysis. This is also known as linear analysis i
n
which the stream of characters making up the source program is read from left-to-
right and grouped into tokens that are sequences of characters having a collective
meaning.
21. What is a lexeme? Define a regular set. Nov/Dec 2006
A Lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
A language denoted by a regular expression is said to be a regular set
22. What is a sentinel? What is its usage? April/May 2004
A Sentinel is a special character that cannot be part of the source program. Normally
w
euse ‘eof as the sentinel. This is used for speeding-up the lexical analyzer.
23. What is a regular expression? State the rules, which define regular expression?
Regular expression is a method to describe regular language
Rules:
1) ε-is a regular expression that denotes {ε} that is the set containing the empty
s
t
r
i
ng
2) If a is a symbol in
,then
a is a regular expression that denotes {a}
3) Suppose r and s are regular expressions denoting the languages L(r ) and L(s)
Th
e
n,
a) (r )/(s) is a regular expression denoting L(r) U L(s).
b) (r )(s) is a regular expression denoting L(r )L(s)
c) (r )* is a regular expression denoting L(r)*.
d) (r) is a regular expression denoting L(r ).
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-1
24. What are the Error-recovery actions in a lexical analyzer?
1. Deleting an extraneous character
2. Inserting a missing character
3. Replacing an incorrect character by a correct character
4. Transposing two adjacent characters
25. Construct Regular expression for the language
L= {w ε{a,b}/w ends in abb}
Ans: {a/b}*abb.
26. What is recognizer?
Recognizers are machines. These are the machines which accept the strings belonging to certain
language. If the valid strings of such language are accepted by the machine then it is said that the
corresponding language is accepted by that machine, otherwise it is rejected.
16 MARKS
1.PHASES OF COMPILER
A
Compiler operates in phases, each of which transforms the source program from one
representation into
another. The following are the phases of the
c
om
pi
l
e
r
:
Main phases:
1)
L
e
x
i
c
a
l
ana
ly
s
i
s
2) Syntax
ana
ly
s
i
s
3) Semantic
ana
ly
s
i
s
4)
I
nte
r
m
e
di
a
t
e
code
ge
n
e
ra
t
i
on
5) Code
opt
i
m
i
z
a
t
i
on
6) Code
gene
ra
t
i
on
Sub-Phases:
1) Symbol
table
m
ana
ge
m
ent
2) Error
handl
i
ng
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-1
LEXICAL ANALYSIS:
I
t
is the first phase of the compiler.
I
t
gets input from the source program and
produc
e
s
tokens as output.
I
t
reads the characters one by one, starting from left to right and forms the
t
oke
ns.
Token :
I
t
represents a logically cohesive sequence of characters such as
ke
y
w
or
ds
,
o
operators, identifiers, special symbols
e
t
c
.
o
Example: a
+ b =
20
o
Here, a,b,+,=,20 are all separate
t
oke
ns.
o
Group of characters forming a token is called the
Lexeme
.
The lexical analyser not only generates a token but also enters the lexeme into the
s
y
mbol
o
table
if
it is not already
t
he
r
e.
SYNTAX ANALYSIS:
I
t
is the second phase of the compiler.
I
t
is also known as
pa
r
s
e
r
.
I
t
gets the token stream as input from the lexical analyser of the compiler and g
e
n
e
ra
t
e
s
o
syntax tree as the output.
Syntax
t
r
ee
:
o
I
t
is a tree in which interior nodes are operators and exterior nodes are opera
nds.
Example: For
a=b+c*2, syntax tree
i
s
=
a +
b
*
c
2
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-1
SEMANTIC ANALYSIS:
I
t
is the third phase of the
c
om
pi
l
e
r
.
I
t
gets input from the syntax analysis as parse tree and checks whether the given syntax
i
s
correct or
not.
I
t
performs type conversion of all the data types into real data
t
y
pe
s
.
INTERMEDIATE CODE
GENERATION:
I
t
is the fourth phase of the
c
om
pi
l
e
r
.
I
t
gets input from the semantic analysis and converts the input into output as
i
nte
r
m
e
di
a
t
e
code such as
t
hr
ee
a
ddre
ss
c
ode
.
The
t
hr
ee
-
a
ddre
ss
code consists of a sequence of instructions, each of which has a
t
m
os
t
three
opera
nds.
Example:
t1=t2+t
3
CODE OPTIMIZATION:
I
t
is the fifth phase of the
c
om
pi
l
e
r
.
I
t
gets the intermediate code as input and produces optimized intermediate code as
output.
This phase reduces the redundant code and attempts to improve the intermediate code
s
o
that
f
a
s
t
e
r
-
running
machine code will
r
e
s
ul
t
.
During the code optimization, the result of the program is not a
ff
e
c
t
ed.
To improve the code generation, the optimization
i
nvolve
s
-
deduction and removal of dead code (unreachable
c
ode
)
.
-
calculation of constants in expressions and
t
e
r
m
s
.
-
collapsing of repeated expression into temporary
s
t
r
i
ng
.
-
loop unr
ol
li
ng
.
-
moving code outside the
l
oop.
-
removal of unwanted temporary
va
r
i
a
bl
e
s
.
CODE GENERATION:
I
t
is the final phase of the
c
om
pi
l
e
r
.
I
t
gets input from code optimization phase and produces the target code or object code as
r
e
s
ul
t
.
I
nte
r
m
e
di
a
t
e
instructions are translated into a sequence of machine instructions
t
ha
t
perform the same
t
a
s
k.
The code generation
i
nvolve
s
-
allocation of register and
m
e
m
or
y
-
generation of correct r
e
f
e
r
e
n
c
e
s
-
generation of correct data
t
y
pe
s
-
generation of missing
c
ode
SYMBOL TABLE MANAGEMENT:
Symbol
table is used to store all the information about identifiers used in the
p
rogra
m
.
I
t
is a data structure containing a record for each identifier, with fields for the attributes
of
the
i
dent
i
f
i
e
r
.
I
t
allows to find the record for each identifier quickly and to store or retrieve data
f
rom
that
r
e
c
or
d.
W
hene
ve
r
an identifier is detected in any of the phases, it is stored in the symbol
t
a
bl
e.
STUDENTSFOCUS.COM
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY QUESTION BANK
CS6660 COMPILER DESIGN UNIT-1
ERROR HANDLING:
Each phase can encounter errors. After detecting an error, a phase must handle the
e
rror
so that compilation can
pr
oc
eed.
I
n
lexical analysis, errors occur in separation of
t
o
k
ens.
I
n
syntax analysis, errors occur during construction of syntax
t
r
ee.
I
n
semantic analysis, errors occur when the compiler detects constructs with r
i
g
ht
syntactic structure but no meaning and during type
c
onve
r
s
i
on.
I
n
code optimization, errors occur when the result is affected by the
opt
i
m
i
z
a
t
i
on
.
I
n
code generation, it shows error when code is missing
e
t
c
.
2.COMPILER CONSTRUCTION TOOLS
These are specialized tools that have been developed for helping implement
va
r
i
ous phases of a compiler.
The following are the compiler construction
t
ool
s
:
1)
Parser Generators:
-
The
s
e
produce syntax analyzers, normally from input that is based on a
c
onte
x
t
-
f
r
ee
gra
mm
ar
.
-
I
t
consumes a large fraction of the running time of a
c
om
pi
l
e
r
.
-
Ex
a
mpl
e
-
YAC
C
(
Ye
t
another
C
om
pi
l
e
r
-
C
om
pi
l
e
r
)
.
2)
Scanner Generator:
-
The
s
e
generate lexical analyzers, normally from a specification based on regular
e
xpr
e
ss
i
ons.
-
The
basic organization of lexical analyzers is based on finite a
ut
oma
t
i
on
.
3)
Syntax-Directed Translation:
-
The
s
e
produce routines that walk the parse tree and as a result generate intermediate
c
od
e.
-
Ea
c
h
translation is defined in terms of translations at its neighbor nodes in the
t
r
ee.
4)
Automatic Code Generators:
-
I
t
takes a collection of rules to translate intermediate language into machine language.
Th
e
rules must include
sufficient details to handle different possible access methods for
da
t
a
.
5)
Data-Flow Engines:
-
I
t
does code optimization using
da
t
a
-
f
l
ow
analysis, that is, the gathering of information a
bout
how values are
transmitted from one part of a program to each other
p
ar
t
.
STUDENTSFOCUS.COM
Shri Vishnu Engineering College For Women
Department of CSE
- 1 -
COMPILER DESIGN
LECTURE NOTES
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SHRI VISHNU ENGINEERING COLLEGE FOR WOMEN
(Approved by AICTE, Accredited by NBA, Affiliated to JNTU Kakinada)
BHIMAVARAM 534 202
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 2 -
UNIT -1
1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM
1.2 Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more
modern flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language
by certain amounts to build-in macro
1.3 COMPILER
Compiler is a translator program that translates a program written in (HLL) the source
program and translate it into an equivalent program in (MLL) the target program. As an
important part of a compiler is error showing to the programmer.
Source pgm target pgm
Error msg
Compiler
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 3 -
Executing a program written n HLL programming language is basically of two parts. the
source program must first be compiled translated into a object program. Then the results
object program is loaded into a memory executed.
Source pgm obj pgm
Obj pgm input opj pgm output
1.4 ASSEMBLER: programmers found it difficult to write or read programs in machine
language. They begin to use a mnemonic (symbols) for each machine instruction, which
they would subsequently translate into machine language. Such a mnemonic machine
language is now called an assembly language. Programs known as assembler were
written to automate the translation of assembly language in to machine language. The
input to an assembler program is called source program, the output is a machine language
translation (object program).
1.5 INTERPRETER: An interpreter is a program that appears to execute a source
program as if it were machine language.
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also
uses interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds.
Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation.
The interpreter for the language makes it machine independent.
Compiler
Obj pgm
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 4 -
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
2 Loader and Link-editor:
Once the assembler procedures an object program, that program must be placed into
memory and executed. The assembler could place the object program directly in memory
and transfer control to it, thereby causing the machine language program to be
execute. This would waste core by leaving the assembler in memory while the user’s
program was being executed. Also the programmer would have to retranslate his program
with each execution, thus wasting translation time. To over come this problems of wasted
translation time and memory. System programmers developed another component called
loader
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form the
loader could”relocate” directly behind the user’s program. The task of adjusting programs o
they may be placed in arbitrary core locations is called relocation. Relocation loaders
perform four functions.
1.6 TRANSLATOR
A translator is a program that takes as input a program written in one language and
produces as output a program in another language. Beside program translation, the translator
performs another very important role, the error-detection. Any violation of d HLL
specification would be detected and reported to the programmers. Important role of translator
are:
1 Translating the hll program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of
the hll.
1.7 TYPE OF TRANSLATORS:-
INTERPRETOR
COMPILER
PREPROSSESSOR
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 5 -
1.8 LIST OF COMPILERS
1. Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .D compilers
9 .Common Lisp compilers
10. ECMAScript interpreters
11. Eiffel compilers
12. Felix compilers
13. Fortran compilers
14. Haskell compilers
15 .Java compilers
16. Pascal compilers
17. PL/I compilers
18. Python compilers
19. Scheme compilers
20. Smalltalk compilers
21. CIL compilers
1.9 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated
operation that takes source program in one representation and produces output in another
representation. The phases of a compiler are shown in below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into no-of-sub processes called ‘phases’.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 6 -
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the
source program into a sequence of automic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this
phase expressions, statements, declarations etc… are identified by using the results of lexical
analysis. Syntax analysis is aided by using techniques based on formal grammar of the
programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced.
This phase bridges the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this phase. The
output of the code generator is the machine language program of the specified computer.
Table Management (or) Book-keeping:-
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 7 -
This is the portion to keep the names used by the program and records
essential information about each. The data structure used to record this information called a
‘Symbol Table’.
Error Handlers:-
It is invoked when a flaw error in the source program is detected.
The output of LA is a stream of tokens, which is passed to the next phase, the
syntax analyzer or parser. The SA groups the tokens together into syntactic structure called
as expression. Expression may further be combined to form statements. The syntactic
structure can be regarded as a tree whose leaves are the token called as parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language. It also
imposes on tokens a tree-like structure that is used by the sub-sequent phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On seeing the /,
the syntax analyzer should detect an error situation, because the presence of these two
adjacent binary operators violates the formulations rule of an expression.
Syntax analysis is to make explicit the hierarchical structure of the incoming
token stream by identifying which parts of the token stream should be grouped.
Example, (A/B*C has two possible interpretations.)
1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.
Intermediate Code Generation:-
The intermediate code generation uses the structure produced by the syntax
analyzer to create a stream of simple instructions. Many styles of intermediate code are
possible. One common style uses instruction with one operator and a small number of
operands.
The output of the syntax analyzer is some representation of a parse tree. the
intermediate code generation phase transforms this parse tree into an intermediate language
representation of the source program.
Code Optimization
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code program that
does the some job as the original, but in a way that saves time and / or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program to
make an improvement. For example,
If A > B goto L2
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 8 -
Goto L3
L2 :
This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common
sub-expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing
the speed of loops. A typical loop improvement is to move a
computation that produces the same result each time around the loop
to a point, in the program just before the loop is entered.
Code generator :-
Cg produces the object code by deciding on the memory locations for data,
selecting code to access each datum and selecting the registers in which each computation is
to be done. Many computers have only a few high speed registers in which computations can
be performed quickly. A good code generator would attempt to utilize registers as efficiently
as possible.
Table Management OR Book-keeping :-
A compiler needs to collect information about all the data objects that appear
in the source program. The information about data objects is collected by the early phases of
the compiler-lexical and syntactic analyzers. The data structure used to record this
information is called as Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and
reporting of errors in the source program. The error message should allow the programmer to
determine exactly where the errors have occurred. Errors may occur in all or the phases of a
compiler.
Whenever a phase of the compiler discovers an error, it must report the error to
the error handler, which issues an appropriate diagnostic msg. Both of the table-management
and error-Handling routines interact with all phases of the compiler.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 9 -
Example:
Position:= initial + rate *60
Tokens id1 = id2 + id3 * id4
=
id1 +
id2 *
id3 id4
=
id1 +
id2 *
id3 60
int to real
temp1:= int to real (60)
temp2:= id3 * temp1
temp3:= id2 + temp2
id1:= temp3.
Temp1:= id3 * 60.0
Lexical Analyzer
Syntsx Analyzer
Semantic Analyzer
Intermediate Code Generator
Code Optimizer
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 10 -
Id1:= id2 +temp1
MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1
1.10 TOKEN
LA reads the source program one character at a time, carving the source program into
a sequence of automatic units called ‘Tokens’.
1, Type of the token.
2, Value of the token.
Type : variable, operator, keyword, constant
Value : N1ame of variable, current variable (or) pointer to symbol table.
If the symbols given in the standard format the LA accepts and produces
token as output. Each token is a sub-string of the program that is to be treated as a single
unit. Token are two types.
1, Specific strings such as IF (or) semicolon.
2, Classes of string such as identifiers, label, constants.
Code Generator
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 11 -
UNIT -2
LEXICAL ANALYSIS
2.1 OVER VIEW OF LEXICAL ANALYSIS
o To identify the tokens we need some method of describing the possible tokens
that can appear in the input stream. For this purpose we introduce regular expression, a
notation that can be used to describe essentially all the tokens of programming
language.
o Secondly , having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers, which are
designed using transition diagrams and finite automata.
2.2 ROLE OF LEXICAL ANALYZER
the LA is the first phase of a compiler. It main task is to read the input character
and produce as output a sequence of tokens that the parser uses for syntax analysis.
Upon receiving a ‘get next token’ command form the parser, the lexical analyzer
reads the input character until it can identify the next token. The LA return to the parser
representation for the token it has found. The representation will be an integer code, if the
token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank,
tab and new line characters. Another is correlating error message from the compiler with the
source program.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 12 -
2.3 LEXICAL ANALYSIS VS PARSING:
Lexical analysis
Parsing
A Scanner simply turns an input String (say a
file) into a list of tokens. These tokens
represent things like identifiers, parentheses,
operators etc.
The lexical analyzer (the "lexer") parses
individual symbols from the source code file
into tokens. From there, the "parser" proper
turns those whole tokens into sentences of
your grammar
A parser converts this list of tokens into a
Tree-like object to represent how the tokens
fit together to form a cohesive whole
(sometimes referred to as a sentence).
A parser does not give the nodes any
meaning beyond structural cohesion. The
next thing to do is extract meaning from this
structure (sometimes called contextual
analysis).
2.4 TOKEN, LEXEME, PATTERN:
Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Example:
Description of token
Token
lexeme
pattern
const
const
const
if
if
If
relation
<,<=,= ,< >,>=,>
< or <= or = or < > or >= or letter
followed by letters & digit
i
pi
any numeric constant
nun
3.14
any character b/w “and “except"
literal
"core"
pattern
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 13 -
A patter is a rule describing the set of lexemes that can represent a particular token in source
program.
2.5 LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means
that there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the
other side, will be thrown by your scanner when a given set of already recognised valid
tokens don't match any of the right sides of your grammar rules. simple panic-mode error
handling system requires that we return to a high-level parsing function when a parsing or
lexical error is detected.
Error-recovery actions are:
i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.
2.6 DIFFERENCE BETWEEN COMPILER AND INTERPRETER
A compiler converts the high level instruction into machine language while an
interpreter converts the high level instruction into an intermediate form.
Before execution, entire program is executed by the compiler whereas after
translating the first line, an interpreter then executes it and so on.
List of errors is created by the compiler after the compilation process while an
interpreter stops translating after the first error.
An independent executable file is created by the compiler whereas interpreter is
required by an interpreted program each time.
The compiler produce object code whereas interpreter does not produce object code.
In the process of compilation the program is analyzed only once and then the code is
generated whereas source program is interpreted every time it is to be executed and
every time the source program is analyzed. hence interpreter is less efficient than
compiler.
Examples of interpreter: A UPS Debugger is basically a graphical source level
debugger but it contains built in C interpreter which can handle multiple source files.
example of compiler: Borland c compiler or Turbo C compiler compiles the programs
written in C or C++.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 14 -
2.7 REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string.
Component of regular expression..
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R2R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view
the set of strings in each token class as an language, we can use the regular-expression
notation to describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters
or digits. In regular expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet .
o is a regular expression denoting { }, that is, the language containing only the
empty string.
o For each a’ in ∑, is a regular expression denoting { a }, the language with only
one string consisting of the single symbol ‘a’ .
o If R and S are regular expressions, then
(R) | (S) means LrULs
R.S means Lr.Ls
R* denotes Lr*
2.8 REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and
to define regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The
following regular definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 15 -
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take
the patterns for all the needed tokens and build a piece of code that examins the input string
and finds a prefix that is a lexeme matching one of the
patterns.
Stmt if expr then stmt
| If expr then else stmt
| є
Expr term relop term
| term
Term id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of
tokens as far as the lexical analyzer is concerned, the patterns for the tokens are described
using regular definitions.
digit -->[0,9]
digits -->digit+
number -->digit(.digit)?(e.[+-]?digits)?
letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --></>/<=/>=/==/< >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
ws (blank/tab/newline)
+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when we
recognize it, we do not return it to parser ,but rather restart the lexical analysis from the
character that follows the white space . It is the following token that gets returned to the
parser.
Lexeme
Token Name
Attribute Value
Any ws
_
_
if
if
_
then
then
_
else
else
_
Any id
id
pointer to table entry
Any number
number
pointer to table
entry
<
relop
LT
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 16 -
<=
relop
LE
=
relop
ET
< >
relop
NE
2.9 TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking for a
lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled
by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s
labeled by a. if we find such an edge ,we advance the forward pointer and enter the
state of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
As an intermediate step in the construction of a LA, we first produce a stylized
flowchart, called a transition diagram. Position in a transition diagram, are drawn as circles
and are called as states.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 17 -
The above TD for an identifier, defined to be a letter followed by any no of letters
or digits.A sequence of transition diagram can be converted into program to look for the
tokens specified by the diagrams. Each state gets a segment of code.
If = if
Then = then
Else = else
Relop = < | <= | = | > | >=
Id = letter (letter | digit) *|
Num = digit |
2.10 AUTOMATA
An automation is defined as a system where information is transmitted and used for
performing some functions without direct participation of man.
1, an automation in which the output depends only on the input is called an
automation without memory.
2, an automation in which the output depends on the input and state also is called as
automation with memory.
3, an automation in which the output depends only on the state of the machine is
called a Moore machine.
3, an automation in which the output depends on the state and input at any instant of
time is called a mealy machine.
2.11 DESCRIPTION OF AUTOMATA
1, an automata has a mechanism to read input from input tape,
2, any language is recognized by some automation, Hence these automation are
basically language ‘acceptors’ or ‘language recognizers’.
Types of Finite Automata
Deterministic Automata
Non-Deterministic Automata.
2.12 DETERMINISTIC AUTOMATA
A deterministic finite automata has at most one transition from each state on any
input. A DFA is a special case of a NFA in which:-
1, it has no transitions on input € ,
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 18 -
2, each input symbol has at most one transition from any state.
DFA formally defined by 5 tuple notation M = (Q, ∑, δ, qo, F), where
Q is a finite ‘set of states’, which is non empty.
∑ is ‘input alphabets’, indicates input set.
qo is an ‘initial state’ and qo is in Q ie, qo, ∑, Q
F is a set of ‘Final states’,
δ is a ‘transmission function’ or mapping function, using this function the
next state can be determined.
The regular expression is converted into minimized DFA by the following procedure:
Regular expression NFA DFA Minimized DFA
The Finite Automata is called DFA if there is only one path for a specific input from
current state to next state.
a
a
b
From state S0 for inputa’ there is only one path going to S2. similarly from S0 there
is only one path for input going to S1.
2.13 NONDETERMINISTIC AUTOMATA
A NFA is a mathematical model that consists of
A set of states S.
A set of input symbols ∑.
A transition for move from one state to an other.
A state so that is distinguished as the start (or initial) state.
A set of states F distinguished as accepting (or final) state.
A number of transition to a single symbol.
So
S1
S2
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 19 -
A NFA can be diagrammatically represented by a labeled directed graph, called a
transition graph, In which the nodes are the states and the labeled edges represent
the transition function.
This graph looks like a transition diagram, but the same character can label two or
more transitions out of one state and edges can be labeled by the special symbol €
as well as by input symbols.
The transition graph for an NFA that recognizes the language ( a | b ) * abb is
shown
2.14 DEFINITION OF CFG
It involves four quantities.
CFG contain terminals, N-T, start symbol and production.
Terminal are basic symbols form which string are formed.
N-terminals are synthetic variables that denote sets of strings
In a Grammar, one N-T are distinguished as the start symbol, and the set of
string it denotes is the language defined by the grammar.
The production of the grammar specify the manor in which the terminal and
N-T can be combined to form strings.
Each production consists of a N-T, followed by an arrow, followed by a string
of one terminal and terminals.
2.15 DEFINITION OF SYMBOL TABLE
An extensible array of records.
The identifier and the associated records contains collected information about
the identifier.
FUNCTION identify (Identifier name)
RETURNING a pointer to identifier information contains
The actual string
A macro definition
A keyword definition
A list of type, variable & function definition
A list of structure and union name definition
A list of structure and union field selected definitions.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 20 -
2.16 Creating a lexical analyzer with Lex
2.17 Lex specifications:
A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
1. The declarations section includes declarations of variables,manifest constants(A manifest
constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14),
and regular definitions.
2. The translation rules of a Lex program are statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
… …
where each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 21 -
Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.
2.18 INPUT BUFFERING
The LA scans the characters of the source pgm one at a time to discover tokens.
Because of large amount of time can be consumed scanning characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an
input character.
Buffering techniques:
1. Buffer pairs
2. Sentinels
The lexical analyzer scans the characters of the source program one a t a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for
thelexical analyzer to read its input from an input buffer. Figure shows a buffer divided into
two haves of, say 100 characters each. One pointer marks the beginning of the token being
discovered. A look ahead pointer scans ahead of the beginning point, until the token is
discovered .we view the position of each pointer as being between the character last read and
thecharacter next to be read. In practice each buffering scheme adopts one convention either
apointer is at the symbol last read or the symbol it is ready to read.
Token beginnings look ahead pointerThe distance which the lookahead pointer may
have to travel past the actual token may belarge. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or
an array name until we see the character that follows the right parenthesis. In either case, the
token itself ends at the second E. If the look ahead pointer travels beyond the buffer half in
which it began, the other half must be loaded with the next characters from the source file.
Since the buffer shown in above figure is of limited size there is an implied constraint on
how much look ahead can be used before the next token is discovered. In the above example,
ifthe look ahead traveled to the left half and all the way through the left half to the middle,
we could not reload the right half, because we would lose characters that had not yet been
groupedinto tokens. While we can make the buffer larger if we chose or use another
buffering scheme,we cannot ignore the fact that overhead is limited.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 22 -
UNIT -3
SYNTAX ANALYSIS
3.1 ROLE OF THE PARSER
Parser obtains a string of tokens from the lexical analyzer and verifies that it can be generated
by the language for the source program. The parser should report any syntax errors in an
intelligible fashion. The two types of parsers employed are:
1.Top down parser: which build parse trees from top(root) to bottom(leaves)
2.Bottom up parser: which build parse trees from leaves and work up the root.
Therefore there are two types of parsing methods top-down parsing and bottom-up parsing
3.2 TOP-DOWN PARSING
A program that performs syntax analysis is called a parser. A syntax analyzer takes tokens as
input and output error message if the program syntax is wrong. The parser uses symbol-look-
ahead and an approach called top-down parsing without backtracking. Top-downparsers
check to see if a string can be generated by a grammar by creating a parse tree starting from
the initial symbol and working down. Bottom-up parsers, however, check to see a string can
be generated from a grammar by creating a parse tree from the leaves, and working up. Early
parser generators such as YACC creates bottom-up parsers whereas many of Java parser
generators such as JavaCC create top-down parsers.
3.3RECURSIVE DESCENT PARSING
Typically, top-down parsers are implemented as a set of recursive functions that descent
through a parse tree for a string. This approach is known as recursive descent parsing, also
known as LL(k) parsing where the first L stands for left-to-right, the second L stands for
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 23 -
leftmost-derivation, and k indicates k-symbol lookahead. Therefore, a parser using the single
symbol look-ahead method and top-down parsing without backtracking is called LL(1)
parser. In the following sections, we will also use an extended BNF notation in which some
regulation expression operators are to be incorporated.
A syntax expression defines sentences of the form , or . A syntax of the form defines
sentences that consist of a sentence of the form followed by a sentence of the form followed
by a sentence of the form . A syntax of the form defines zero or one occurrence of the form .
A syntax of the form defines zero or more occurrences of the form .
A usual implementation of an LL(1) parser is:
o initialize its data structures,
o get the lookahead token by calling scanner routines, and
o call the routine that implements the start symbol.
Here is an example.
proc syntaxAnalysis()
begin
initialize(); // initialize global data and structures
nextToken(); // get the lookahead token
program(); // parser routine that implements the start symbol
end;
3.4 FIRST AND FOLLOW
To compute FIRST(X) for all grammar symbols X, apply the following rules until
no more terminals or e can be added to any FIRST set.
1. If X is terminal, then FIRST(X) is {X}.
2. If X->e is a production, then add e to FIRST(X).
3. If X is nonterminal and X->Y1Y2...Yk is a production, then place a in FIRST(X) if for
some i, a is in FIRST(Yi) and e is in all of FIRST(Y1),...,FIRST(Yi-1) that is,
Y1.......Yi-
1
=*>e. If e is in FIRST(Yj) for all j=1,2,...,k, then add e to FIRST(X). For
example, everything in FIRST(Yj) is surely in FIRST(X). If y1 does not derive e, then we
add nothing more to FIRST(X), but if Y1=*>e, then we add FIRST(Y2) and so on.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 24 -
To compute the FIRST(A) for all nonterminals A, apply the following rules until nothing
can be added to any FOLLOW set.
1. Place $ in FOLLOW(S), where S is the start symbol and $ in the input right endmarker.
2. If there is a production A=>aBs where FIRST(s) except e is placed in FOLLOW(B).
3. If there is aproduction A->aB or a production A->aBs where FIRST(s) contains e, then
everything in FOLLOW(A) is in FOLLOW(B).
Consider the following example to understand the concept of First and Follow.Find the first
and follow of all nonterminals in the Grammar-
E -> TE'
E'-> +TE'|e
T -> FT'
T'-> *FT'|e
F -> (E)|id
Then:
FIRST(E)=FIRST(T)=FIRST(F)={(,id}
FIRST(E')={+,e}
FIRST(T')={*,e}
FOLLOW(E)=FOLLOW(E')={),$}
FOLLOW(T)=FOLLOW(T')={+,),$}
FOLLOW(F)={+,*,),$}
For example, id and left parenthesis are added to FIRST(F) by rule 3 in definition of FIRST
with i=1 in each case, since FIRST(id)=(id) and FIRST('(')= {(} by rule 1. Then by rule 3
with i=1, the production T -> FT' implies that id and left parenthesis belong to FIRST(T)
also.
To compute FOLLOW,we put $ in FOLLOW(E) by rule 1 for FOLLOW. By rule 2 applied
toproduction F-> (E), right parenthesis is also in FOLLOW(E). By rule 3 applied to
production E-> TE', $ and right parenthesis are in FOLLOW(E').
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 25 -
3.5 CONSTRUCTION OF PREDICTIVE PARSING TABLES
For any grammar G, the following algorithm can be used to construct the predictive parsing
table. The algorithm is
Input : Grammar G
Output : Parsing table M
Method
1. 1.For each production A-> a of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(a), add A->a, to M[A,a].
3. If e is in First(a), add A->a to M[A,b] for each terminal b in FOLLOW(A). If e is in
FIRST(a) and $ is in FOLLOW(A), add A->a to M[A,$].
4. Make each undefined entry of M be error.
3.6.LL(1) GRAMMAR
The above algorithm can be applied to any grammar G to produce a parsing table M. For
some Grammars, for example if G is left recursive or ambiguous, then M will have at least
one multiply-defined entry. A grammar whose parsing table has no multiply defined entries
is said to be LL(1). It can be shown that the above algorithm can be used to produce for every
LL(1) grammar G a parsing table M that parses all and only the sentences of G. LL(1)
grammars have several distinctive properties. No ambiguous or left recursive grammar can
be LL(1). There remains a question of what should be done in case of multiply defined
entries. One easy solution is to eliminate all left recursion and left factoring, hoping to
produce a grammar which will produce no multiply defined entries in the parse tables.
Unfortunately there are some grammars which will give an LL(1) grammar after any kind of
alteration. In general, there are no universal rules to convert multiply defined entries into
single valued entries without affecting the language recognized by the parser.
The main difficulty in using predictive parsing is in writing a grammar for the source
language such that a predictive parser can be constructed from the grammar. Although left
recursion elimination and left factoring are easy to do, they make the resulting grammar hard
to read and difficult to use the translation purposes. To alleviate some of this difficulty, a
common organization for a parser in a compiler is to use a predictive parser for control
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 26 -
constructs and to use operator precedence for expressions.however, if an lr parser generator
is available, one can get all the benefits of predictive parsing and operator precedence
automatically.
3.7.ERROR RECOVERY IN PREDICTIVE PARSING
The stack of a nonrecursive predictive parser makes explicit the terminals and nonterminals
that the parser hopes to match with the remainder of the input. We shall therefore refer to
symbols on the parser stack in the following discussion. An error is detected during
predictive parsing when the terminal on top of the stack does not match the next input
symbol or when nonterminal A is on top of the stack, a is the next input symbol, and the
parsing table entry M[A,a] is empty.
Panic-mode error recovery is based on the idea of skipping symbols on the input until a token
in a selected set of synchronizing tokens appears. Its effectiveness depends on the choice of
synchronizing set. The sets should be chosen so that the parser recovers quickly from errors
that are likely to occur in practice. Some heuristics are as follows
As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing
set for nonterminal A. If we skip tokens until an element of FOLLOW(A) is seen and
pop A from the stack, it is likely that parsing can continue.
It is not enough to use FOLLOW(A) as the synchronizingset for A. Fo example , if
semicolons terminate statements, as in C, then keywords that begin statements may
not appear in the FOLLOW set of the nonterminal generating expressions. A missing
semicolon after an assignment may therefore result in the keyword beginning the next
statement being skipped. Often, there is a hierarchica structure on constructs in a
language; e.g., expressions appear within statement, which appear within bblocks,and
so on. We can add to the synchronizing set of a lower construct the symbols that
begin higher constructs. For example, we might add keywords that begin statements
to the synchronizing sets for the nonterminals generaitn expressions.
If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it
may be possible to resume parsing according to A if a symbol in FIRST(A) appears in
the input.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 27 -
If a nonterminal can generate the empty string, then the production deriving e can be
used as a default. Doing so may postpone some error detection, but cannot cause an
error to be missed. This approach reduces the number of nonterminals that have to be
considered during error recovery.
If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue parsing.
In effect, this approach takes the synchronizing set of a token to consist of all other
tokens.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 28 -
UNIT 4
LR PARSER
4.1 LR PARSING INTRODUCTION
The "L" is for left-to-right scanning of the input and the "R" is for constructing a rightmost
derivation in reverse.
4.2 WHY LR PARSING:
LR parsers can be constructed to recognize virtually all programming-language
constructs for which context-free grammars can be written.
The LR parsing method is the most general non-backtracking shift-reduce parsing
method known, yet it can be implemented as efficiently as other shift-reduce
methods.
The class of grammars that can be parsed using LR methods is a proper subset of the
class of grammars that can be parsed with predictive parsers.
An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to-
right scan of the input.
The disadvantage is that it takes too much work to constuct an LR parser by hand for a
typical programming-language grammar. But there are lots of LR parser generators available
to make this task easy.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 29 -
4.3.MODELS OF LR PARSERS
The schematic form of an LR parser is shown below.
The program uses a stack to store a string of the form s0X1s1X2...Xmsm where sm is on top.
Each Xi is a grammar symbol and each si is a symbol representing a state. Each state symbol
summarizes the information contained in the stack below it. The combination of the state
symbol on top of the stack and the current input symbol are used to index the parsing table
and determine the shiftreduce parsing decision. The parsing table consists of two parts: a
parsing action function action and a goto function goto. The program driving the LR parser
behaves as follows: It determines sm the state currently on top of the stack and ai the current
input symbol. It then consults action[sm,ai], which can have one of four values:
shift s, where s is a state
reduce by a grammar production A -> b
accept
error
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 30 -
The function goto takes a state and grammar symbol as arguments and produces a state.
For a parsing table constructed for a grammar G, the goto table is the transition function of a
deterministic finite automaton that recognizes the viable prefixes of G. Recall that the viable
prefixes of G are those prefixes of right-sentential forms that can appear on the stack of a
shiftreduce parser because they do not extend past the rightmost handle.
A configuration of an LR parser is a pair whose first component is the stack contents and
whose second component is the unexpended input:
(s0 X1 s1 X2 s2... Xm sm, ai ai+1... an$)
This configuration represents the right-sentential form
X1 X1 ... Xm ai ai+1 ...an
in essentially the same way a shift-reduce parser would; only the presence of the states on the
stack is new. Recall the sample parse we did (see Example 1: Sample bottom-up parse) in
which we assembled the right-sentential form by concatenating the remainder of the input
buffer to the top of the stack. The next move of the parser is determined by reading ai and
sm, and consulting the parsing action table entry action[sm, ai]. Note that we are just looking
at the state here and no symbol below it. We'll see how this actually works later.
The configurations resulting after each of the four types of move are as follows:
If action[sm, ai] = shift s, the parser executes a shift move entering the configuration
(s0 X1 s1 X2 s2... Xm sm ai s, ai+1... an$)
Here the parser has shifted both the current input symbol ai and the next symbol.
If action[sm, ai] = reduce A -> b, then the parser executes a reduce move, entering the
configuration,
(s0 X1 s1 X2 s2... Xm-r sm-r A s, ai ai+1... an$)
where s = goto[sm-r, A] and r is the length of b, the right side of the production. The parser
first popped 2r symbols off the stack (r state symbols and r grammar symbols), exposing state
sm-r. The parser then pushed both A, the left side of the production, and s, the entry for
goto[sm-r, A], onto the stack. The current input symbol is not changed in a reduce move.
The output of an LR parser is generated after a reduce move by executing the semantic action
associated with the reducing production. For example, we might just print out the production
reduced.
If action[sm, ai] = accept, parsing is completed.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 31 -
4.4.OPERATOR PRECEDENCE PARSING
Precedence Relations
Bottom-up parsers for a large class of context-free grammars can be easily developed
using operator grammars.Operator grammars have the property that no production right side
is empty or has two adjacent nonterminals. This property enables the implementation of
efficient operator-precedence parsers. These parser rely on the following three precedence
relations:
Relation Meaning
a <· b a yields precedence to b
a =· b a has the same precedence as b
a ·> b a takes precedence over b
These operator precedence relations allow to delimit the handles in the right sentential
forms: marks the left end, appears in the interior of the handle, and ·> marks the right
end.
Example: The input string:
id1 + id2 * id3
after inserting precedence relations becomes
$ <· id1 ·> + <· id2 ·> * <· id3 ·> $
Having precedence relations allows to identify handles as follows:
scan the string from left until seeing ·>
scan backwards the string from right to left until seeing <·
everything between the two relations <· and ·> forms the handle
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 32 -
4.5 OPERATOR PRECEDENCE PARSING ALGORITHM
Initialize: Set ip to point to the first symbol of w$
Repeat: Let X be the top stack symbol, and a the symbol pointed to by ip
if $ is on the top of the stack and ip points to $ then return
else
Let a be the top terminal on the stack, and b the symbol pointed to
by ip
if a <· b or a =· b then
push b onto the stack
advance ip to the next input symbol
else if a
·
> b then
repeat
pop the stack
until the top stack terminal is related by <·
to the terminal most recently popped
else error()
end
4.6 ALGORITHM FOR CONSTRUCTING PRECEDENCE FUNCTIONS
1. Create functions fa for each grammar terminal a and for the end of string symbol;
2. Partition the symbols in groups so that fa and gb are in the same group if a =· b ( there
can be symbols in the same group even if they are not connected by this relation)
3. Create a directed graph whose nodes are in the groups, next for each symbols a and b
do: place an edge from the group of gb to the group of fa if a <· b, otherwise if a ·> b
place an edge from the group of fa to that of gb;
4. If the constructed graph has a cycle then no precedence functions exist. When there are
no cycles collect the length of the longest paths from the groups of fa and gb Example:
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 33 -
Consider the above table Using the algorithm leads to the following graph:
4.7 SHIFT REDUCE PARSING
A shift-reduce parser uses a parse stack which (conceptually) contains grammar symbols.
During the operation of the parser, symbols from the input are shifted onto the stack. If a
prefix of the symbols on top of the stack matches the RHS of a grammar rule which is the
correct rule to use within the current context, then the parser reduces the RHS of the rule to
its LHS,replacing the RHS symbols on top of the stack with the nonterminal occurring on the
LHS of the rule. This shift-reduce process continues until the parser terminates, reporting
either success or failure. It terminates with success when the input is legal and is accepted by
the parser. It terminates with failure if an error is detected in the input. The parser is nothing
but a stack automaton which may be in one of several discrete states. A state is usually
represented simply as an integer. In reality, the parse stack contains states, rather than
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 34 -
grammar symbols. However, since each state corresponds to a unique grammar symbol, the
state stack can be mapped onto the grammar symbol stack mentioned earlier.
The operation of the parser is controlled by a couple of tables:
4.8 ACTION TABLE
The action table is a table with rows indexed by states and columns indexed by terminal
symbols. When the parser is in some state s and the current lookahead terminal is t, the
action taken by the parser depends on the contents of action[s][t], which can contain four
different kinds of entries:
Shift s'
Shift state s' onto the parse stack.
Reduce r
Reduce by rule r. This is explained in more detail below.
Accept
Terminate the parse with success, accepting the input.
Error
Signal a parse error
4.9 GOTO TABLE
The goto table is a table with rows indexed by states and columns indexed by nonterminal
symbols. When the parser is in state s immediately after reducing by rule N, then the next
state to enter is given by goto[s][N].
The current state of a shift-reduce parser is the state on top of the state stack. The detailed
operation of such a parser is as follows:
1. Initialize the parse stack to contain a single state s0, where s0 is the distinguished initial
state of the parser.
2. Use the state s on top of the parse stack and the current lookahead t to consult the action
table entry action[s][t]:
· If the action table entry is shift s' then push state s' onto the stack and advance the
input so that the lookahead is set to the next token.
· If the action table entry is reduce r and rule r has m symbols in its RHS, then pop
m symbols off the parse stack. Let s' be the state now revealed on top of the parse
stack and N be the LHS nonterminal for rule r. Then consult the goto table and
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 35 -
push the state given by goto[s'][N] onto the stack. The lookahead token is not
changed by this step.
If the action table entry is accept, then terminate the parse with success.
If the action table entry is error, then signal an error.
3. Repeat step (2) until the parser terminates.
For example, consider the following simple grammar
0) $S: stmt <EOF>
1) stmt: ID ':=' expr
2) expr: expr '+' ID
3) expr: expr '-' ID
4) expr: ID
which describes assignment statements like a:= b + c - d. (Rule 0 is a special augmenting
production added to the grammar).
One possible set of shift-reduce parsing tables is shown below (sn denotes shift n, rn denotes
reduce n, acc denotes accept and blank entries denote error entries):
Parser Tables
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 36 -
4.10 SLR PARSER
An LR(0) item (or just item) of a grammar G is a production of G with a dot at some position
of the right side indicating how much of a production we have seen up to a given point.
For example, for the production E -> E + T we would have the following items:
[E -> .E + T]
[E -> E. + T]
[E -> E +. T]
[E -> E + T.]
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 37 -
4.11 CONSTRUCTING THE SLR PARSING TABLE
To construct the parser table we must convert our NFA into a DFA. The states in the LR
table will be the e-closures of the states corresponding to the items SO...the process of
creating the LR state table parallels the process of constructing an equivalent DFA from a
machine with e-transitions. Been there, done that - this is essentially the subset construction
algorithm so we are in familiar territory here.
We need two operations: closure()
and goto().
closure()
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I by
the two rules: Initially every item in I is added to closure(I)
If A -> a.Bb is in closure(I), and B -> g is a production, then add the initial item [B -> .g] to I,
if it is not already there. Apply this rule until no more new items can be added to closure(I).
From our grammar above, if I is the set of one item {[E'-> .E]}, then closure(I) contains:
I0: E' -> .E
E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
goto()
goto(I, X), where I is a set of items and X is a grammar symbol, is defined to be the closure
of the set of all items [A -> aX.b] such that [A -> a.Xb] is in I. The idea here is fairly intuitive:
if I is the set of items that are valid for some viable prefix g, then goto(I, X) is the set of items
that are valid for the viable prefix gX.
4.12 SETS-OF-ITEMS-CONSTRUCTION
To construct the canonical collection of sets of LR(0) items for
augmented grammar G'.
procedure items(G')
begin
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 38 -
C := {closure({[S' -> .S]})};
repeat
for each set of items in C and each grammar symbol X
such that goto(I, X) is not empty and not in C do
add goto(I, X) to C;
until no more sets of items can be added to C
end;
4.13 ALGORITHM FOR CONSTRUCTING AN SLR PARSING TABLE
Input: augmented grammar G'
Output: SLR parsing table functions action and goto for G'
Method:
Construct C = {I0, I1 , ..., In} the collection of sets of LR(0) items for G'.
State i is constructed from Ii:
if [A -> a.ab] is in Ii and goto(Ii, a) = Ij, then set action[i, a] to "shift j". Here a must be a
terminal.
if [A -> a.] is in Ii, then set action[i, a] to "reduce A -> a" for all a in FOLLOW(A). Here A
may
not be S'.
if [S' -> S.] is in Ii, then set action[i, $] to "accept"
If any conflicting actions are generated by these rules, the grammar is not SLR(1) and the
algorithm fails to produce a parser. The goto transitions for state i are constructed for all
nonterminals A using the rule: If goto(Ii, A)= Ij, then goto[i, A] = j.
All entries not defined by rules 2 and 3 are made "error".
The inital state of the parser is the one constructed from the set of items containing [S' -> .S].
Let's work an example to get a feel for what is going on,
An Example
(1) E -> E * B
(2) E -> E + B
(3) E -> B
(4) B -> 0
(5) B -> 1
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 39 -
The Action and Goto Table The two LR(0) parsing tables for this grammar look as follows:
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 40 -
UNIT -5
5.1 CANONICAL LR PARSING
By splitting states when necessary, we can arrange to have each state of an LR parser
indicate exactly which input symbols can follow a handle a for which there is a possible
reduction to A. As the text points out, sometimes the FOLLOW sets give too much
informationand doesn't (can't) discriminate between different reductions.
The general form of an LR(k) item becomes [A -> a.b, s] where A -> ab is a production and s
is a string of terminals. The first part (A -> a.b) is called the core and the second part is the
lookahead. In LR(1) |s| is 1, so s is a single terminal.
A -> ab is the usual righthand side with a marker; any a in s is an incoming token in which
we are interested. Completed items used to be reduced for every incoming token in
FOLLOW(A), but now we will reduce only if the next input token is in the lookahead set s..if
we get two productions A -> a and B -> a, we can tell them apart when a is a handle on the
stack if the corresponding completed items have different lookahead parts. Furthermore, note
that the lookahead has no effect for an item of the form [A -> a.b, a] if b is not e. Recall that
our problem occurs for completed items, so what we have done now is to say that an item of
the form [A -> a., a] calls for a reduction by A -> a only if the next input symbol is a. More
formally, an LR(1) item [A -> a.b, a] is valid for a viable prefix g if there is a derivation
S =>* s abw, where g = sa, and either a is the first symbol of w, or w is e and a is $.
5.2 ALGORITHM FOR CONSTRUCTION OF THE SETS OF LR(1) ITEMS
Input: grammar G'
Output: sets of LR(1) items that are the set of items valid for one or more viable prefixes of
G'
Method:
closure(I)
begin
repeat
for each item [A -> a.Bb, a] in I,
each production B -> g in G',
and each terminal b in FIRST(ba)
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 41 -
such that [B -> .g, b] is not in I do
add [B -> .g, b] to I;
until no more items can be added to I;
end;
5.3 goto(I, X)
begin
let J be the set of items [A -> aX.b, a] such that
[A -> a.Xb, a] is in I
return closure(J);
end;
procedure items(G')
begin
C := {closure({S' -> .S, $})};
repeat
for each set of items I in C and each grammar symbol X such
that goto(I, X) is not empty and not in C do
add goto(I, X) to C
until no more sets of items can be added to C;
end;
An example,
Consider the following grammer,
S’->S
S->CC
C->cC
C->d
Sets of LR(1) items
I0: S’->.S,$
S->.CC,$
C->.Cc,c/d
C->.d,c/d
I1:S’->S.,$
I2:S->C.C,$
C->.Cc,$
C->.d,$
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 42 -
I3:C->c.C,c/d
C->.Cc,c/d
C->.d,c/d
I4: C->d.,c/d
I5: S->CC.,$
I6: C->c.C,$
C->.cC,$
C->.d,$
I7:C->d.,$
I8:C->cC.,c/d
I9:C->cC.,$
Here is what the corresponding DFA looks like
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 43 -
5.4 ALGORITHM FOR CONSTRUCTION OF THE CANONICAL LR PARSING
TABLE
Input: grammar G'
Output: canonical LR parsing table functions action and goto
1. Construct C = {I0, I1 , ..., In} the collection of sets of LR(1) items for G'.State i is
constructed from Ii.
2. if [A -> a.ab, b>] is in Ii and goto(Ii, a) = Ij, then set action[i, a] to "shift j". Here a
must be a terminal.
3. if [A -> a., a] is in Ii, then set action[i, a] to "reduce A -> a" for all a in
FOLLOW(A). Here A may not be S'.
4. if [S' -> S.] is in Ii, then set action[i, $] to "accept"
5. If any conflicting actions are generated by these rules, the grammar is not LR(1)
and the algorithm fails to produce a parser.
6. The goto transitions for state i are constructed for all nonterminals A using the
rule: If goto(Ii, A)= Ij, then goto[i, A] = j.
7. All entries not defined by rules 2 and 3 are made "error".
8. The inital state of the parser is the one constructed from the set of items
containing [S' -> .S, $].
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 44 -
5.5.LALR PARSER:
We begin with two observations. First, some of the states generated for LR(1) parsing have
the same set of core (or first) components and differ only in their second component, the
lookahead symbol. Our intuition is that we should be able to merge these states and reduce
the number of states we have, getting close to the number of states that would be generated
for LR(0) parsing. This observation suggests a hybrid approach: We can construct the
canonical LR(1) sets of items and then look for sets of items having the same core. We merge
these sets with common cores into one set of items. The merging of states with common
cores can never produce a shift/reduce conflict that was not present in one of the original
states because shift actions depend only on the core, not the lookahead. But it is possible for
the merger to produce a reduce/reduce conflict.
Our second observation is that we are really only interested in the lookahead symbol in
places where there is a problem. So our next thought is to take the LR(0) set of items and add
lookaheads only where they are needed. This leads to a more efficient, but much more
complicated method.
5.6 ALGORITHM FOR EASY CONSTRUCTION OF AN LALR TABLE
Input: G'
Output: LALR parsing table functions with action and goto for G'.
Method:
1. Construct C = {I0, I1 , ..., In} the collection of sets of LR(1) items for G'.
2. For each core present among the set of LR(1) items, find all sets having that core
and replace these sets by the union.
3. Let C' = {J0, J1 , ..., Jm} be the resulting sets of LR(1) items. The parsing actions
for state i are constructed from Ji in the same manner as in the construction of the
canonical LR parsing table.
4. If there is a conflict, the grammar is not LALR(1) and the algorithm fails.
5. The goto table is constructed as follows: If J is the union of one or more sets of
LR(1) items, that is, J = I0U I1 U ... U Ik, then the cores of goto(I0, X), goto(I1,
X), ..., goto(Ik, X) are the same, since I0, I1 , ..., Ik all have the same core. Let K
be the union of all sets of items having the same core asgoto(I1, X).
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 45 -
6. Then goto(J, X) = K.
Consider the above example,
I3 & I6 can be replaced by their union
I36:C->c.C,c/d/$
C->.Cc,C/D/$
C->.d,c/d/$
I47:C->d.,c/d/$
I89:C->Cc.,c/d/$
Parsing Table
state
c
d
$
S
C
0
S36
S47
1
2
1
Accept
2
S36
S47
5
36
S36
S47
89
47
R3
R3
5
R1
89
R2
R2
R2
5.7HANDLING ERRORS
The LALR parser may continue to do reductions after the LR parser would have spotted an
error, but the LALR parser will never do a shift after the point the LR parser would have
discovered the error and will eventually find the error.
5.8 DANGLING ELSE
The dangling else is a problem in computer programming in which an optional else clause in
an Ifthen(else) statement results in nested conditionals being ambiguous. Formally,
the context-free grammar of the language is ambiguous, meaning there is more than one
correct parse tree.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 46 -
In many programming languages one may write conditionally executed code in two forms:
the if-then form, and the if-then-else form the else clause is optional:
Consider the grammar:
S ::= E $
E ::= E + E
| E * E
| ( E )
| id
| num
and four of its LALR(1) states:
I0: S ::= . E $ ?
E ::= . E + E +*$ I1: S ::= E . $ ? I2: E ::= E * . E +*$
E ::= . E * E +*$ E ::= E . + E +*$ E ::= . E + E +*$
E ::= . ( E ) +*$ E ::= E . * E +*$ E ::= . E * E +*$
E ::= . id +*$ E ::= . ( E ) +*$
E ::= . num +*$ I3: E ::= E * E . +*$ E ::= . id +*$
E ::= E . + E +*$ E ::= . num +*$
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 47 -
E ::= E . * E +*$
Here we have a shift-reduce error. Consider the first two items in I3. If we have a*b+c and
we parsed a*b, do we reduce using E ::= E * E or do we shift more symbols? In the former
case we get a parse tree (a*b)+c; in the latter case we get a*(b+c). To resolve this conflict, we
can specify that * has higher precedence than +. The precedence of a grammar production is
equal to the precedence of the rightmost token at the rhs of the production. For example, the
precedence of the production E ::= E * E is equal to the precedence of the operator *, the
precedence of the production E ::= ( E ) is equal to the precedence of the token ), and the
precedence of the production E ::= if E then E else E is equal to the precedence of the token
else. The idea is that if the look ahead has higher precedence than the production currently
used, we shift. For example, if we are parsing E + E using the production rule E ::= E + E
and the look ahead is *, we shift *. If the look ahead has the same precedence as that of the
current production and is left associative, we reduce, otherwise we shift. The above grammar
is valid if we define the precedence and associativity of all the operators. Thus, it is very
important when you write a parser using CUP or any other LALR(1) parser generator to
specify associativities and precedence’s for most tokens (especially for those used as
operators). Note: you can explicitly define the precedence of a rule in CUP using the %prec
directive:
E ::= MINUS E %prec UMINUS
where UMINUS is a pseudo-token that has higher precedence than TIMES, MINUS etc, so
that -1*2 is equal to (-1)*2, not to -(1*2).
Another thing we can do when specifying an LALR(1) grammar for a parser generator is
error recovery. All the entries in the ACTION and GOTO tables that have no content
correspond to syntax errors. The simplest thing to do in case of error is to report it and stop
the parsing. But we would like to continue parsing finding more errors. This is called error
recovery. Consider the grammar:
S ::= L = E ;
| { SL } ;
| error ;
SL ::= S ;
| SL S ;
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 48 -
The special token error indicates to the parser what to do in case of invalid syntax for S (an
invalid statement). In this case, it reads all the tokens from the input stream until it finds the
first semicolon. The way the parser handles this is to first push an error state in the stack. In
case of an error, the parser pops out elements from the stack until it finds an error state where
it can proceed. Then it discards tokens from the input until a restart is possible. Inserting
error handling productions in the proper places in a grammar to do good error recovery is
considered very hard.
5.9LR ERROR RECOVERY
An LR parser will detect an error when it consults the parsing action table and find a blank or
error entry. Errors are never detected by consulting the goto table. An LR parser will detect
an error as soon as there is no valid continuation for the portion of the input thus far scanned.
A canonical LR parser will not make even a single reduction before announcing the error.
SLR and LALR parsers may make several reductions before detecting an error, but they will
never shift an erroneous input symbol onto the stack.
5.10 PANIC-MODE ERROR RECOVERY
We can implement panic-mode error recovery by scanning down the stack until a state s with
a goto on a particular nonterminal A is found. Zero or more input symbols are then discarded
until a symbol a is found that can legitimately follow A. The parser then stacks the state
GOTO(s, A) and resumes normal parsing. The situation might exist where there is more than
one choice for the nonterminal A. Normally these would be nonterminals representing major
program pieces, e.g. an expression, a statement, or a block. For example, if A is the
nonterminal stmt, a might be semicolon or }, which marks the end of a statement sequence.
This method of error recovery attempts to eliminate the phrase containing the syntactic error.
The parser determines that a string derivable from A contains an error. Part of that string has
already been processed, and the result of this processing is a sequence of states on top of the
stack. The remainder of the string is still in the input, and the parser attempts to skip over the
remainder of this string by looking for a symbol on the input that can legitimately follow A.
By removing states from the stack, skipping over the input, and pushing GOTO(s, A) on the
stack, the parser pretends that if has found an instance of A and resumes normal parsing.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 49 -
5.11 PHRASE-LEVEL RECOVERY
Phrase-level recovery is implemented by examining each error entry in the LR action table
and deciding on the basis of language usage the most likely programmer error that would
give rise to that error. An appropriate recovery procedure can then be constructed;
presumably the top of the stack and/or first input symbol would be modified in a way deemed
appropriate for each error entry. In designing specific error-handling routines for an LR
parser, we can fill in each blank entry in the action field with a pointer to an error routine that
will take the appropriate action selected by the compiler designer.
The actions may include insertion or deletion of symbols from the stack or the input or both,
or alteration and transposition of input symbols. We must make our choices so that the LR
parser will not get into an infinite loop. A safe strategy will assure that at least one input
symbol will be removed or shifted eventually, or that the stack will eventually shrink if the
end of the input has been reached. Popping a stack state that covers a non terminal should be
avoided, because this modification eliminates from the stack a construct that has already been
successfully parsed.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 50 -
UNIT 6
SEMANTIC ANALYSIS
6.1 SEMANTIC ANALYSIS
Semantic Analysis computes additional information related to the meaning of the
program once the syntactic structure is known.
In typed languages as C, semantic analysis involves adding information to the symbol
table and performing type checking.
The information to be computed is beyond the capabilities of standard parsing
techniques, therefore it is not regarded as syntax.
As for Lexical and Syntax analysis, also for Semantic Analysis we need both a
Representation Formalism and an Implementation Mechanism.
As representation formalism this lecture illustrates what are called Syntax Directed
Translations.
6.2 SYNTAX DIRECTED TRANSLATION
The Principle of Syntax Directed Translation states that the meaning of an input
sentence is related to its syntactic structure, i.e., to its Parse-Tree.
By Syntax Directed Translations we indicate those formalisms for specifying
translations for programming language constructs guided by context-free grammars.
o We associate Attributes to the grammar symbols representing the language
constructs.
o Values for attributes are computed by Semantic Rules associated with
grammar productions.
Evaluation of Semantic Rules may:
o Generate Code;
o Insert information into the Symbol Table;
o Perform Semantic Check;
o Issue error messages;
o etc.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 51 -
There are two notations for attaching semantic rules:
1. Syntax Directed Definitions. High-level specification hiding many implementation
details (also called Attribute Grammars).
2. Translation Schemes. More implementation oriented: Indicate the order in which
semantic rules are to be evaluated.
Syntax Directed Definitions
Syntax Directed Definitions are a generalization of context-free grammars in which:
1. Grammar symbols have an associated set of Attributes;
2. Productions are associated with Semantic Rules for computing the values of attributes.
Such formalism generates Annotated Parse-Trees where each node of the tree is a
record with a field for each attribute (e.g.,X.a indicates the attribute a of the grammar
symbol X).
The value of an attribute of a grammar symbol at a given parse-tree node is defined by
a semantic rule associated with the production used at that node.
We distinguish between two kinds of attributes:
1. Synthesized Attributes. They are computed from the values of the attributes of the
children nodes.
2. Inherited Attributes. They are computed from the values of the attributes of both the
siblings and the parent nodes
Syntax Directed Definitions: An Example
Example. Let us consider the Grammar for arithmetic expressions. The
Syntax Directed Definition associates to each non terminal a synthesized
attribute called val.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 52 -
6.3 S-ATTRIBUTED DEFINITIONS
Definition. An S-Attributed Definition is a Syntax Directed Definition that uses
only synthesized attributes.
Evaluation Order. Semantic rules in a S-Attributed Definition can be
evaluated by a bottom-up, or PostOrder, traversal of the parse-tree.
Example. The above arithmetic grammar is an example of an S-Attributed
Definition. The annotated parse-tree for the input 3*5+4n is:
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 53 -
6.4 L-attributed definition
Definition: A SDD its L-attributed if each inherited attribute of Xi in the RHS of A ! X1 :
:Xn depends only on
1. attributes of X1;X2; : : : ;Xi1 (symbols to the left of Xi in the RHS)
2. inherited attributes of A.
Restrictions for translation schemes:
1. Inherited attribute of Xi must be computed by an action before Xi.
2. An action must not refer to synthesized attribute of any symbol to the right of that action.
3. Synthesized attribute for A can only be computed after all attributes it references have
been completed (usually at end of RHS).
6.5 SYMBOL TABLES
A symbol table is a major data structure used in a compiler. Associates attributes with
identifiers used in a program. For instance, a type attribute is usually associated with each
identifier. A symbol table is a necessary component Definition (declaration) of identifiers
appears once in a program .Use of identifiers may appear in many places of the program text
Identifiers and attributes are entered by the analysis phases. When processing a definition
(declaration) of an identifier. In simple languages with only global variables and implicit
declarations. The scanner can enter an identifier into a symbol table if it is not already there
In block-structured languages with scopes and explicit declarations:
The parser and/or semantic analyzer enter identifiers and corresponding attributes
Symbol table information is used by the analysis and synthesis phases
To verify that used identifiers have been defined (declared)
To verify that expressions and assignments are semantically correct type checking
To generate intermediate or target code
Symbol Table Interface
The basic operations defined on a symbol table include:
allocate to allocate a new empty symbol table
free to remove all entries and free the storage of a symbol table
insert to insert a name in a symbol table and return a pointer to its entry
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 54 -
lookup to search for a name and return a pointer to its entry
set_attribute to associate an attribute with a given entry
get_attribute to get an attribute associated with a given entry
Other operations can be added depending on requirement For example, a delete operation
removes a name previously inserted Some identifiers become invisible (out of scope) after
exiting a block
This interface provides an abstract view of a symbol table
Supports the simultaneous existence of multiple tables
Implementation can vary without modifying the interface
Basic Implementation Techniques
First consideration is how to insert and lookup names
Variety of implementation techniques
Unordered List
Simplest to implement
Implemented as an array or a linked list
Linked list can grow dynamically alleviates problem of a fixed size array
Insertion is fast O(1), but lookup is slow for large tables O(n) on average
Ordered List
If an array is sorted, it can be searched using binary search O(log2 n)
Insertion into a sorted array is expensive O(n) on average
Useful when set of names is known in advance table of reserved words
Binary Search Tree
Can grow dynamically
Insertion and lookup are O(log2 n) on average
6.6 HASH TABLES AND HASH FUNCTIONS
A hash table is an array with index range: 0 to TableSize 1
Most commonly used data structure to implement symbol tables
Insertion and lookup can be made very fast O(1)
A hash function maps an identifier name into a table index
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 55 -
A hash function, h(name), should depend solely on name
h(name) should be computed quickly
h should be uniform and randomizing in distributing names
All table indices should be mapped with equal probability.
Similar names should not cluster to the same table index
6.7 HASH FUNCTIONS
_ Hash functions can be defined in many ways . . .
_ A string can be treated as a sequence of integer words
_ Several characters are fit into an integer word
_ Strings longer than one word are folded using exclusive-or or addition
_ Hash value is obtained by taking integer word modulo TableSize
_ We can also compute a hash value character by character:
_ h(name) = (c0 + c1 + … + cn1) mod TableSize, where n is name length
_ h(name) = (c0 * c1 * … * cn1) mod TableSize
_ h(name) = (cn1 + ___ cn–2 + … + ___ c1 + __c0))) mod TableSize
_ h(name) = (c0 * cn1 * n) mod TableSize
6.8 RUNTIME ENVIRONMENT
Runtime organization of different storage locations
Representation of scopes and extents during program execution.
Components of executing program reside in blocks of memory (supplied by OS).
Three kinds of entities that need to be managed at runtime:
o Generated code for various procedures and programs.
forms text or code segment of your program: size known at compile time.
o Data objects:
Global variables/constants: size known at compile time
Variables declared within procedures/blocks: size known
Variables created dynamically: size unknown.
o Stack to keep track of procedure activations.
Subdivide memory conceptually into code and data areas:
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 56 -
Code: Program
instructions
Stack: Manage activation of procedures at runtime.
Heap: holds variables created dynamically
6.9 STORAGE ORGANIZATION
1Fixed-size objects can be placed in predefined locations.
2. Run-time stack and heap
The STACK is used to store:
o Procedure activations.
o The status of the machine just before calling a procedure, so that the status can be
restored when the called procedure returns.
o The HEAP stores data allocated under program control (e.g. by malloc() in C).
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 57 -
Activation records
Any information needed for a single activation of a procedure is stored in the
ACTIVATION RECORD (sometimes called the STACK FRAME). Today, we’ll assume the
stack grows DOWNWARD, as on, e.g., the Intel architecture. The activation record gets
pushed for each procedure call and popped for each procedure return.
6.9 STATIC ALLOCATION
Statically allocated names are bound to storage at compile time. Storage bindings of
statically allocated names never change, so even if a name is local to a procedure, its name is
always bound to the same storage. The compiler uses the type of a name (retrieved from the
symbol table) to determine storage size required. The required number of bytes (possibly
aligned) is set aside for the name.The address of the storage is fixed at compile time.
Limitations:
The size required must be known at compile time.
Recursive procedures cannot be implemented as all locals are statically
allocated.
No data structure can be created dynamically as all data is static.
Stack-dynamic allocation
Storage is organized as a stack.
Activation records are pushed and popped.
Locals and parameters are contained in the activation records for the call.
This means locals are bound to fresh storage on every call.
If we have a stack growing downwards, we just need a stack_top pointer.
To allocate a new activation record, we just increase stack_top.
To deallocate an existing activation record, we just decrease stack_top.
Address generation in stack allocation
The position of the activation record on the stack cannot be determined statically.
Therefore the compiler must generate addresses RELATIVE to the activation record. If we
have a downward-growing stack and a stack_top pointer, we generate addresses of the form
stack_top + offset
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 58 -
6.10 HEAP ALLOCATION
Some languages do not have tree-structured allocations. In these cases, activations
have to be allocated on the heap. This allows strange situations, like callee activations that
live longer than their callers’ activations. This is not common Heap is used for allocating
space for objects created at run timeFor example: nodes of dynamic data structures such as
linked lists and trees
Dynamic memory allocation and deallocation based on the requirements of the
programmalloc() and free() in C programs
new()and delete()in C++ programs
new()and garbage collection in Java programs
Allocation and deallocation may be completely manual (C/C++), semi-automatic(Java), or
fully automatic (Lisp)
6.11 PARAMETERS PASSING
A language has first-class functionsif functions can bedeclared within any scope
passed as arguments to other functions returned as results of functions.In a language with
first-class functions and static scope, a function value is generally represented by a closure. a
pair consisting of a pointer to function code a pointer to an activation record.Passing
functions as arguments is very useful in structuring of systems using upcalls
An example:
main()
{ int x = 4;
int f (int y) {
return x*y;
}
int g (int →int h){
int x = 7;
return h(3) + x;
}
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 59 -
g(f);//returns 12
}
Passing Functions as Parameters Implementation with Static Scope
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 60 -
UNIT 7
INTERMEDIATE CODE
7.1. INTERMEDIATE CODE GENERATION
In the analysis-synthesis model of a compiler, the front end analyzes a source
program and creates an intermediate representation, from which the back end generates target
code. This facilitates retargeting: enables attaching a back end for the new machine to an
existing front end.
Logical Structure of a Compiler Front End
A compiler front end is organized as in figure above, where parsing, static checking,
and intermediate-code generation are done sequentially; sometimes they can be combined
and folded into parsing. All schemes can be implemented by creating a syntax tree and then
walking the tree.
Static Checking
This includes type checking which ensures that operators are applied to compatible
operands. It also includes any syntactic checks that remain after parsing like
flowof-control checks
o Ex: Break statement within a loop construct
Uniqueness checks
o Labels in case statements
Name-related checks
Intermediate Representations
We could translate the source program directly into the target language. However, there
are benefits to having an intermediate, machine-independent representation.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 61 -
A clear distinction between the machine-independent and machine-dependent parts of
the compiler
Retargeting is facilitated the implementation of language processors for new
machines will require replacing only the back-end.
We could apply machine independent code optimization techniques
Intermediate representations span the gap between the source and target languages.
High Level Representations
closer to the source language
easy to generate from an input program
code optimizations may not be straightforward
Low Level Representations
closer to the target machine
Suitable for register allocation and instruction selection
easier for optimizations, final code generation
There are several options for intermediate code. They can be either
• Specific to the language being implemented
P-code for Pascal
Byte code for Java
7.2 LANGUAGE INDEPENDENT 3-ADDRESS CODE
IR can be either an actual language or a group of internal data structures that are shared by
the phases of the compiler. C used as intermediate language as it is flexible, compiles into
efficient machine code and its compilers are widely available.In all cases, the intermediate
code is a linearization of the syntax tree produced during syntax and semantic analysis. It is
formed by breaking down the tree structure into sequential instructions, each of which is
equivalent to a single, or small number of machine instructions. Machine code can then be
generated (access might be required to symbol tables etc). TAC can range from high- to low-
level, depending on the choice of operators. In general, it is a statement containing at most 3
addresses or operands.
The general form is x := y op z, where “op” is an operator, x is the result, and y and z are
operands. x, y, z are variables, constants, or “temporaries”. A three-address instruction
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 62 -
consists of at most 3 addresses for each statement.
It is a linear zed representation of a binary syntax tree. Explicit names correspond to interior
nodes of the graph. E.g. for a looping statement , syntax tree represents components of the
statement, whereas three-address code contains labels and jump instructions to represent the
flow-of-control as in machine language. A TAC instruction has at most one operator on the
RHS of an instruction; no built-up arithmetic expressions are permitted.
e.g. x + y * z can be translated as
t1 = y * z
t2 = x + t1
Where t1 & t2 are compilergenerated temporary names.
5Since it unravels multi-operator arithmetic expressions and nested control-flow statements,
it is useful for target code generation and optimization.
Addresses and Instructions
• TAC consists of a sequence of instructions, each instruction may have up to three
addresses, prototypically t1 = t2 op t3
• Addresses may be one of:
o A name. Each name is a symbol table index. For convenience, we writethe names
as the identifier.
o A constant.
o A compiler-generated temporary. Each time a temporary address is needed, the
compiler generates another name from the stream t1, t2, t3, etc.
• Temporary names allow for code optimization to easily move Instructions
• At target-code generation time, these names will be allocated to registers or to memory.
• TAC Instructions
o Symbolic labels will be used by instructions that alter the flow of control.
The instruction addresses of labels will be filled in later.
L: t1 = t2 op t3
o Assignment instructions: x = y op z
• Includes binary arithmetic and logical operations
o Unary assignments: x = op y
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 63 -
• Includes unary arithmetic op (-) and logical op (!) and type
conversion
o Copy instructions: x = y
o Unconditional jump: goto L
• L is a symbolic label of an instruction
o Conditional jumps:
if x goto L If x is true, execute instruction L next
ifFalse x goto L If x is false, execute instruction L next
o Conditional jumps:
if x relop y goto L
Procedure calls. For a procedure call p(x1, …, xn)
param x1
param xn
call p, n
Function calls : y= p(x1, …, xn) y = call p,n , return y
Indexed copy instructions: x = y[i] and x[i] = y
Left: sets x to the value in the location i memory units beyond y
Right: sets the contents of the location i memory units beyond x to y
Address and pointer instructions:
• x = &y sets the value of x to be the location (address) of y.
• x = *y, presumably y is a pointer or temporary whose value is a
location. The value of x is set to the contents of that location.
• *x = y sets the value of the object pointed to by x to the value of y.
Example: Given the statement do i = i+1; while (a[i] < v ); , the TAC can be written as
below in two ways, using either symbolic labels or position number of instructions for
labels.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 64 -
Types of three address code
There are different types of statements in source program to which three address code has to
be generated. Along with operands and operators, three address code also use labels to
provide flow of control for statements like if-then-else, for and while. The different types of
three address code statements are:
Assignment statement
a = b op c
In the above case b and c are operands, while op is binary or logical operator. The result of
applying op on b and c is stored in a.
Unary operation
a = op b This is used for unary minus or logical negation.
Example: a = b * (- c) + d
Three address code for the above example will be
t1 = -c
t2 = t1 * b
t3 = t2 + d
a = t3
Copy Statement
a = b
The value of b is stored in variable a.
Unconditional jump
goto L
Creates label L and generates three-address code ‘goto L’
v. Creates label L, generate code for expression exp, If the exp returns value true then go to
the statement labelled L. exp returns a value false go to the statement immediately following
the if statement.
Function call
For a function fun with n arguments a1,a2,a3….an ie.,
fun(a1, a2, a3,…an),
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 65 -
the three address code will be
Param a1
Param a2
Param an
Call fun, n
Where param defines the arguments to function.
Array indexing
In order to access the elements of array either single dimension or
multidimension, three address code requires base address and offset value. Base address
consists of the address of first element in an array. Other elements of the array can be
accessed using the base address and offset value.
Example: x = y[i]
Memory location m = Base address of y + Displacement i
x = contents of memory location m
similarly x[i] = y
Memory location m = Base address of x + Displacement i
The value of y is stored in memory location m
Pointer assignment
x = &y x stores the address of memory location y
x = *y y is a pointer whose r-value is location
*x = y sets r-value of the object pointed by x to the r-value of y
Intermediate representation should have an operator set which is rich to implement most of
the
operations of source language. It should also help in mapping to restricted instruction set of
target machine.
Data Structure
Three address code is represented as record structure with fields for operator and operands.
These
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 66 -
records can be stored as array or linked list. Most common implementations of three address
code are-
Quadruples, Triples and Indirect triples.
7.3 QUADRUPLES-
Quadruples consists of four fields in the record structure. One field to store operator op, two
fields to store operands or arguments arg1and arg2 and one field to store result res. res = arg1
op arg2
Example: a = b + c
b is represented as arg1, c is represented as arg2, + as op and a as res.
Unary operators like -‘do not use agr2. Operators like param do not use agr2 nor result. For
conditional and unconditional statements res is label. Arg1, arg2 and res are pointers to
symbol table or literal table for the names.
Example: a = -b * d + c + (-b) * d
Three address code for the above statement is as follows
t1 = - b
t2 = t1 * d
t3 = t2 + c
t4 = - b
t5 = t4 * d
t6 = t3 + t5
a = t6
Quadruples for the above example is as follows
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 67 -
7.4 TRIPLES
Triples uses only three fields in the record structure. One field for operator, two fields for
operands named as arg1 and arg2. Value of temporary variable can be accessed by the
position of the statement the computes it and not by location as in quadruples.
Example: a = -b * d + c + (-b) * d
Triples for the above example is as follows
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 68 -
Arg1 and arg2 may be pointers to symbol table for program variables or literal table for
constant or pointers into triple structure for intermediate results.
Example: Triples for statement x[i] = y which generates two records is as follows
Triples for statement x = y[i] which generates two records is as follows
Triples are alternative ways for representing syntax tree or Directed acyclic graph for
program defined names.
Indirect Triples
Indirect triples are used to achieve indirection in listing of pointers. That is, it uses pointers to
triples than listing of triples themselves.
Example: a = -b * d + c + (-b) * d
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 69 -
Conditional operator and operands. Representations include quadruples, triples and indirect
triples.
7.5 SYNTAX TREES
Syntax trees are high level IR. They depict the natural hierarchical structure of the source
program. Nodes represent constructs in source program and the children of a node represent
meaningful components of the construct. Syntax trees are suited for static type checking.
Variants of Syntax Trees: DAG
A directed acyclic graph (DAG) for an expression identifies the common sub expressions
(sub expressions that occur more than once) of the expression. DAG's can be constructed
by using the same techniques that construct syntax trees.
A DAG has leaves corresponding to atomic operands and interior nodes corresponding to
operators. A node N in a DAG has more than one parent if N represents a common sub
expression, so a DAG represents expressions concisely. It gives clues to compiler about
the generating efficient code to evaluate expressions.
Example 1: Given the grammar below, for the input string id + id * id , the parse tree,
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 70 -
syntax tree and the DAG are as shown.
Example : DAG for the expression a + a * (b - c) + ( b - c ) * d is shown below.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 71 -
Using the SDD to draw syntax tree or DAG for a given expression:-
• Draw the parse tree
• Perform a post order traversal of the parse tree
• Perform the semantic actions at every node during the traversal
Constructs a DAG if before creating a new node, these functions check whether an
identical node already exists. If yes, the existing node is returned.
SDD to produce Syntax trees or DAG is shown below.
For the expression a + a * ( b c) + (b - c) * d, steps for constructing the DAG is as
below.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 72 -
7.6 BASIC BLOCKS AND FLOW GRAPHS
A graph representation of three-address statements, called a flow graph, is useful for
understanding code-generation algorithms, even if the graph is not explicitly constructed by a
code-generation algorithm. Nodes in the flow graph represent computations, and the edges
represent the flow of control. Flow graph of a program can be used as a vehicle to collect
information about the intermediate program. Some register-assignment algorithms use flow
graphs to find the inner loops where a program is expected to spend most of its time.
BASIC BLOCKS
A basic block is a sequence of consecutive statements in which flow of control
enters at the beginning and leaves at the end without halt or possibility of branching except at
the end. The following sequence of three-address statements forms a basic block:
t1 := a*a
t2 := a*b
t3 := 2*t2
t4 := t1+t3
t5 := b*b
t6 := t4+t5
A three-address statement x := y+z is said to define x and to use y or z. A name in a basic
block is said to live at a given point if its value is used after that point in the program,
perhaps in another basic block.
The following algorithm can be used to partition a sequence of three-address statements into
basic blocks.
Algorithm 1: Partition into basic blocks.
Input: A sequence of three-address statements.
Output: A list of basic blocks with each three-address statement in exactly one block.
Method:
1. We first determine the set of leaders, the first statements of basic blocks.
The rules we use are the following:
I) The first statement is a leader.
II) Any statement that is the target of a conditional or unconditional goto is a leader.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 73 -
III) Any statement that immediately follows a goto or conditional goto statement is a
leader.
2. For each leader, its basic block consists of the leader and all statements up to but not
including the next leader or the end of the program.
Example 3: Consider the fragment of source code shown in fig. 7; it computes the dot
product of two vectors a and b of length 20. A list of three-address statements performing
this computation on our target machine is shown in fig. 8.
begin
prod := 0;
i := 1;
do begin
prod := prod + a[i] * b[i];
i := i+1;
end
while i<= 20
end
Let us apply Algorithm 1 to the three-address code in fig 8 to determine its basic
blocks. statement (1) is a leader by rule (I) and statement (3) is a leader by rule (II), since the
last statement can jump to it. By rule (III) the statement following (12) is a leader. Therefore,
statements (1) and (2) form a basic block. The remainder of the program beginning with
statement (3) forms a second basic block.
(1) prod := 0
(2) i := 1
(3) t1 := 4*i
(4) t2 := a [ t1 ]
(5) t3 := 4*i
(6) t4 :=b [ t3 ]
(7) t5 := t2*t4
(8) t6 := prod +t5
(9) prod := t6
(10) t7 := i+1
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 74 -
(11) i := t7
(12) if i<=20 goto (3)
7.7 TRANSFORMATIONS ON BASIC BLOCKS
A basic block computes a set of expressions. These expressions are the values of the names
live on exit from block. Two basic blocks are said to be equivalent if they compute the same
set of expressions. A number of transformations can be applied to a basic block without
changing the set of expressions computed by the block. Many of these transformations are
useful for improving the quality of code that will be ultimately generated from a basic block.
There are two important classes of local transformations that can be applied to basic blocks;
these are the structure-preserving transformations and the algebraic transformations.
7.8 STRUCTURE-PRESERVING TRANSFORMATIONS
The primary structure-preserving transformations on basic blocks are:
1. Common sub-expression elimination
2. Dead-code elimination
3. Renaming of temporary variables
4. Interchange of two independent adjacent statements
We assume basic blocks have no arrays, pointers, or procedure calls.
1. Common sub-expression elimination
Consider the basic block
a:= b+c
b:= a-d
c:= b+c
d:= a-d
The second and fourth statements compute the same expression, namely b+c-d, and hence
this basic block may be transformed into the equivalent block
a:= b+c
b:= a-d
c:= b+c d:= b
Although the 1st and 3rd statements in both cases appear to have the same expression
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 75 -
on the right, the second statement redefines b. Therefore, the value of b in the 3rd
statement is different from the value of b in the 1st, and the 1st and 3rd statements do
not compute the same expression.
2. Dead-code elimination
Suppose x is dead, that is, never subsequently used, at the point where the statement
x:= y+z appears in a basic block. Then this statement may be safely removed without
changing the value of the basic block.
3. Renaming temporary variables
Suppose we have a statement t:= b+c, where t is a temporary. If we change this statement to
u:= b+c, where u is a new temporary variable, and change all uses of this instance of t to u,
then the value of the basic block is not changed.
4.Interchange of statements
Suppose we have a block with the two adjacent statements
t1:= b+c
t2:= x+y
Then we can interchange the two statements without affecting the value of the block if and
only if neither x nor y is t1 and neither b nor c is t2. A normal-form basic block permits all
statement interchanges that are possible.
7.9 DAG REPRESENTATION OF BASIC BLOCKS
The goal is to obtain a visual picture of how information flows through the block. The leaves
will show the values entering the block and as we proceed up the DAG we encounter uses of
these values defs (and redefs) of values and uses of the new values.
Formally, this is defined as follows.
1. Create a leaf for the initial value of each variable appearing in the block. (We do not
know what that the value is, not even if the variable has ever been given a value).
2. Create a node N for each statement s in the block.
i. Label N with the operator of s. This label is drawn inside the node.
ii. Attach to N those variables for which N is the last def in the block. These additional labels
are drawn along side of N.
iii. Draw edges from N to each statement that is the last def of an operand used by N.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 76 -
2. Designate as output nodes those N whose values are live on exit, an officially-mysterious
term meaning values possibly used in another block. (Determining the live on exit values
requires global, i.e., inter-block, flow analysis.) As we shall see in the next few sections
various basic-block optimizations are facilitated by using the DAG.
Finding Local Common Subexpressions
As we create nodes for each statement, proceeding in the static order of the tatements, we
might notice that a new node is just like one already in the DAG in which case we don't need
a new node and can use the old node to compute the new value in addition to the one it
already was computing. Specifically, we do not construct a new node if an existing node has
the same children in the same order and is labeled with the same operation.
Consider computing the DAG for the following block of code.
a = b + c
c = a + x
d = b + c
b = a + x
The DAG construction is explain as follows (the movie on the right accompanies the
explanation).
1. First we construct leaves with the initial values.
2. Next we process a = b + c. This produces a node labeled + with a attached and having b0
and c0 as children.
3. Next we process c = a + x.
4. Next we process d = b + c. Although we have already computed b + c in the first
statement, the c's are not the same, so we produce a new node.
5. Then we process b = a + x. Since we have already computed a + x in statement 2, we do
not produce a new node, but instead attach b to the old node.
6. Finally, we tidy up and erase the unused initial values.
You might think that with only three computation nodes in the DAG, the block could be
reduced to three statements (dropping the computation of b). However, this is wrong. Only if
b is dead on exit can we omit the computation of b. We can, however, replace the last
statement with the simpler b = c. Sometimes a combination of techniques finds
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 77 -
improvements that no single technique would find. For example if a-b is computed, then both
a and b are incremented by one, and then a-b is computed again, it will not be recognized as a
common subexpression even though the value has not changed. However, when combined
with various algebraic transformations, the common value can be recognized.
7.10 DEAD CODE ELIMINATION
Assume we are told (by global flow analysis) that certain values are dead on exit. We
examine each root (node with no ancestor) and delete any that have no live variables
attached. This process is repeated since new roots may have appeared.
For example, if we are told, for the picture on the right, that only a and b are live, then the
root d can be removed since d is dead. Then the rightmost node becomes a root, which also
can be removed (since c is dead).
The Use of Algebraic Identities
Some of these are quite clear. We can of course replace x+0 or 0+x by simply x. Similar
Considerations apply to 1*x, x*1, x-0, and x/1.
Strength reduction
Another class of simplifications is strength reduction, where we replace one operation by a
cheaper one. A simple example is replacing 2*x by x+x on architectures where addition is
cheaper than multiplication. A more sophisticated strength reduction is applied by compilers
that recognize induction variables (loop indices). Inside a for i from 1 to N loop, the
expression 4*i can be strength reduced to j=j+4 and 2^i can be strength reduced to j=2*j
(with suitable initializations of j just before the loop). Other uses of algebraic identities are
possible; many require a careful reading of the language
reference manual to ensure their legality. For example, even though it might be advantageous
to convert ((a + b) * f(x)) * a to ((a + b) * a) * f(x)
it is illegal in Fortran since the programmer's use of parentheses to specify the order of
operations can not be violated.
Does
a = b + c
x = y + c + b + r
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 78 -
contain a common sub expression of b+c that need be evaluated only once?
The answer depends on whether the language permits the use of the associative and
commutative law for addition. (Note that the associative law is invalid for floating point
numbers.)
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 79 -
UNIT-8
OPTIMIZATION
8.1 PRINCIPLE SOURCES OF OPTIMIZATION
A transformation of a program is called local if it can be performed by looking only at the
statements in a bas9ic block; otherwise, it is called global. Many transformations can be
performed at both the local and global levels. Local transformations are usually performed
first.
Function-Preserving Transformations There are a number of ways in which a compiler can
improve a program without changing the function it computes. Common sub expression
elimination, copy propagation, deadcode elimination, and constant folding are common
examples of such function-preserving transformations. The other transformations come up
primarily when global optimizations are performed. Frequently, a program will include
several calculations of the same value, such as an offset in an array. Some of these duplicate
calculations cannot be avoided by the programmer because they lie below the level of detail
accessible within the source language. For example, block B5 recalculates 4*i and 4*j.
Common Sub expressions An occurrence of an expression E is called a common sub
expression if E was previously computed, and the values of variables in E have not changed
since the previous computation. We can avoid re computing the expression if we can use the
previously computed value. For example, the assignments to t7 and t10 have the common sub
expressions 4*I and 4*j, respectively, on the right side in Fig. They have been eliminated in
Fig by using t6 instead of t7 and t8 instead of t10. This change is what would result if we
reconstructed the intermediate code from the dag for the basic block.
Example: the above Fig shows the result of eliminating both global and local common sub
expressions from blocks B5 and B6 in the flow graph of Fig. We first discuss the
transformation of B5 and then mention some subtleties involving arrays.
After local common sub expressions are eliminated B5 still evaluates 4*i and 4*j, as
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 80 -
Shown in the earlier fig. Both are common sub expressions; in particular, the three statements
t8:= 4*j; t9:= a[t[8]; a[t8]:=x in B5 can be replaced by t9:= a[t4]; a[t4:= x using t4 computed
in block B3. In Fig. observe that as control passes from the evaluation of 4*j in B3 to B5,
there is no change in j, so t4 can be used if 4*j is needed.
Another common sub expression comes to light in B5 after t4 replaces t8. The new
expression a[t4] corresponds to the value of a[j] at the source level. Not only does j retain its
value as control leaves b3 and then enters B5, but a[j], a value computed into a temporary t5,
does too because there are no assignments to elements of the array a in the interim. The
statement t9:= a[t4]; a[t6]:= t9 in B5 can therefore be replaced by
a[t6]:= t5 The expression in blocks B1 and B6 is not considered a common sub expression
although t1 can be used in both places. After control leaves B1 and before it reaches B6,it
can go through B5,where there are assignments to a. Hence, a[t1] may not have the same
value on reaching B6 as it did in leaving B1, and it is not safe to treat a[t1] as a common sub
expression.
Copy Propagation
Block B5 in Fig. can be further improved by eliminating x using two new transformations.
One concerns assignments of the form f:=g called copy statements, or copies for short. Had
we gone into more detail in Example 10.2, copies would have arisen much sooner, because
the algorithm for eliminating common sub expressions introduces them, as do several other
algorithms. For example, when the common sub expression in c:=d+e is eliminated in Fig.,
the algorithm uses a new variable t to hold the value of d+e. Since control may reach c:=d+e
either after the assignment to a or after the assignment to b, it would be incorrect to replace
c:=d+e by either c:=a or by c:=b. The idea behind the copy-propagation transformation is to
use g for f, wherever possible after the copy statement f:=g. For example, the assignment
x:=t3 in block B5 of Fig. is a copy. Copy propagation applied to B5 yields:
x:=t3
a[t2]:=t5
a[t4]:=t3
goto B2 Copies introduced during common subexpression elimination. This may not appear
to be an improvement, but as we shall see, it gives us the opportunity to eliminate the
assignment to x.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 81 -
8.2 DEAD-CODE ELIMINATIONS
A variable is live at a point in a program if its value can be used subsequently; otherwise, it is
dead at that point. A related idea is dead or useless code, statements that compute values that
never get used. While the programmer is unlikely to introduce any dead code intentionally, it
may appear as the result of previous transformations. For example, we discussed the use of
debug that is set to true or false at various points in the program, and used in statements like
If (debug) print. By a data-flow analysis, it may be possible to deduce that each time the
program reaches this statement, the value of debug is false. Usually, it is because there is one
particular statement Debug :=false
That we can deduce to be the last assignment to debug prior to the test no matter what
sequence of branches the program actually takes. If copy propagation replaces debug by
false, then theprint statement is dead because it cannot be reached. We can eliminate both the
test and printing from the o9bject code. More generally, deducing at compile time that the
value of an expression is a constant and using the constant instead is known as constant
folding. One advantage of copy propagation is that it often turns the copy statement into dead
code. For example, copy propagation followed by dead-code elimination removes the
assignment to x and transforms 1.1 into
a [t2 ] := t5
a [t4] := t3
goto B2
8.3 PEEPHOLE OPTIMIZATION
A statement-by-statement code-generations strategy often produce target code that contains
redundant instructions and suboptimal constructs .The quality of such target code can be
improved by applying “optimizing” transformations to the target program.
A simple but effective technique for improving the target code is peephole optimization, a
method for trying to improving the performance of the target program by examining a short
sequence of target instructions (called the peephole) and replacing these instructions by a
shorter or faster sequence, whenever possible.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 82 -
The peephole is a small, moving window on the target program. The code in
the peephole need not contiguous, although some implementations do require this. We shall
give the following examples of program transformations that are characteristic of peephole
optimizations:
• Redundant-instructions elimination
• Flow-of-control optimizations
• Algebraic simplifications
• Use of machine idioms
REDUNTANT LOADS AND STORES
If we see the instructions sequence
(1) (1) MOV R0,a
(2) (2) MOV a,R0
-we can delete instructions (2) because whenever (2) is executed. (1) will ensure that the
value of a is already in register R0.If (2) had a label we could not be sure that (1) was always
executed immediately before (2) and so we could not remove (2).
UNREACHABLE CODE
Another opportunity for peephole optimizations is the removal of unreachable
instructions. An unlabeled instruction immediately following an unconditional jump may be
removed. This operation can be repeated to eliminate a sequence of instructions. For
example, for debugging purposes, a large program may have within it certain segments that
are executed only if a variable debug is 1.In C, the source code might look like:
#define debug 0
….
If ( debug ) {
Print debugging information
}
In the intermediate representations the if-statement may be translated as:
If debug =1 goto L2
Goto L2
L1: print debugging information
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 83 -
L2: …………………………(a)
One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter what
the value of debug; (a) can be replaced by:
If debug ≠1 goto L2
Print debugging information
L2: ……………………………(b)
As the argument of the statement of (b) evaluates to a constant true it can be replaced by
If debug ≠0 goto L2
Print debugging information
L2: ……………………………(c)
As the argument of the first statement of (c) evaluates to a constant true, it can be replaced by
goto L2. Then all the statement that print debugging aids are manifestly unreachable and can
be eliminated one at a time.
8.4 FLOW-OF-CONTROL OPTIMIZATIONS
The unnecessary jumps can be eliminated in either the intermediate code or the
target code by the following types of peephole optimizations. We can replace the jump
sequence
goto L2
….
L1 : gotoL2
by the sequence
goto L2
….
L1 : goto L2
If there are now no jumps to L1, then it may be possible to eliminate the statement L1:goto
L2 provided it is preceded by an unconditional jump .Similarly, the sequence
if a < b goto L1
….
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 84 -
L1 : goto L2
can be replaced by
if a < b goto L2
….
L1 : goto L2
Finally, suppose there is only one jump to L1 and L1 is preceded by an unconditional goto.
Then the sequence
goto L1
……..
L1:if a<b goto L2
L3: …………………………………..(1)
may be replaced by
if a<b goto L2
goto L3
…….
L3: ………………………………….(2)
While the number of instructions in (1) and (2) is the same, we sometimes skip the
unconditional jump in (2), but never in (1).Thus (2) is superior to (1) in execution time
8.5 REGISTER ALLOCATION
Instructions involving register operands are usually shorter and faster than those involving
operands in memory. Therefore, efficient utilization of register is particularly important in
generating good code. The use of registers is often subdivided into two sub problems:
1. During register allocation, we select the set of variables that will reside in registers at a
point in the program.
2. During a subsequent register assignment phase, we pick the specific register that a variable
will reside in. Finding an optimal assignment of registers to variables is difficult, even with
single register values. Mathematically, the problem is NP-complete. The problem is further
complicated because the hardware and/or the operating system of the target machine may
require that certain register usage conventions be observed.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 85 -
Certain machines require register pairs (an even and next odd numbered register) for some
operands and results. For example, in the IBM System/370 machines integer multiplication
and integer division involve register pairs. The multiplication instruction is of the form M x,
y where x, is the multiplicand, is the even register of an even/odd register pair.
The multiplicand value is taken from the odd register pair. The multiplier y is a single
register. The product occupies the entire even/odd register pair.
The division instruction is of the form D x, y where the 64-bit dividend occupies an even/odd
register pair whose even register is x; y represents the divisor. After division, the even
register holds the remainder and the odd register the quotient. Now consider the two three
address code sequences (a) and (b) in which the only difference is
the operator in the second statement. The shortest assembly sequence for (a) and (b) are
given in(c). Ri stands for register i. L, ST and A stand for load, store and add respectively.
The optimal choice for the register into which ‘a’ is to be loaded depends on what will
ultimately happen to e.
t := a + b t := a + b
t := t * c t := t + c
t := t / d t := t / d
(a) (b)
Two three address code sequences
L R1, a L R0, a
A R1, b A R0, b
M R0, c A R0, c
D R0, d SRDA R0,
ST R1, t D R0, d
ST R1, t
(a) (b)
8.6 CHOICE OF OF EVALUATION ORDER
The order in which computations are performed can affect the efficiency of the target code.
Some computation orders require fewer registers to hold intermediate results than others.
Picking a best order is another difficult, NP-complete problem. Initially, we shall avoid the
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 86 -
problem by generating code for the three -address statements in the order in which they have
been produced by the intermediate code generator.
8.7 APPROCHES TO CODE GENERATION
The most important criterion for a code generator is that it produce correct code. Correctness
takes on special significance because of the number of special cases that code generator must
face. Given the premium on correctness, designing a code generator so it can be easily
implemented, tested, and maintained is an important design goal Reference Counting
Garbage Collection The difficulty in garbage collection is not the actual process of collecting
the garbage--it is the problem of finding the garbage in the first place. An object is
considered to be garbage when no references to that object exist. But how can we tell when
no references to an object exist? A simple expedient is to keep track in each object of the
total number of references to that object. That is, we add a special field to each object called
a reference count . The idea is that the reference count field is not accessible to the Java
program. Instead, the reference count field is updated by the Java virtual machine itself.
Consider the statement
Object p = new Integer (57);
which creates a new instance of the Integer class. Only a single variable, p, refers to the
object. Thus, its reference count should be one.
Figure: Objects with reference counters.
Now consider the following sequence of statements:
Object p = new Integer (57);
Object q = p;
This sequence creates a single Integer instance. Both p and q refer to the same object.
Therefore, its reference count should be two.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 87 -
In general, every time one reference variable is assigned to another, it may be necessary to
update several reference counts. Suppose p and q are both reference variables. The
assignment
p = q;
would be implemented by the Java virtual machine as follows:
if (p != q)
{
if (p != null)
--p.refCount;
p = q;
if (p != null)
++p.refCount;
}
For example suppose p and q are initialized as follows:
Object p = new Integer (57);
Object q = new Integer (99);
As shown in Figure (a), two Integer objects are created, each with a reference count of
one. Now, suppose we assign q to p using the code sequence given above. Figure (b)
shows that after the assignment, both p and q refer to the same object--its reference count is
two. And the reference count on Integer(57) has gone to zero which indicates that it is
garbage.
Figure: Reference counts before and after the assignment p = q.
The costs of using reference counts are twofold: First, every object requires the special
reference count field. Typically, this means an extra word of storage must be allocated in
each object. Second, every time one reference is assigned to another, the reference counts
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 88 -
must be adjusted as above. This increases significantly the time taken by assignment
statements.
The advantage of using reference counts is that garbage is easily identified. When it becomes
necessary to reclaim the storage from unused objects, the garbage collector needs only to
examine the reference count fields of all the objects that have been created by the program. If
the reference count is zero, the object is garbage.
It is not necessary to wait until there is insufficient memory before initiating the garbage
collection process. We can reclaim memory used by an object immediately when its
reference goes to zero. Consider what happens if we implement the Java assignment p = q in
the Java virtual machine as follows:
if (p != q)
{
if (p != null)
if (--p.refCount == 0)
heap.release (p);
p = q;
if (p != null)
++p.refCount;
}
Notice that the release method is invoked immediately when the reference count of an object
goes to zero, i.e., when it becomes garbage. In this way, garbage may be collected
incrementally as it is created.
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
Shri Vishnu Engineering College For Women
Department of CSE
- 89 -
TEXT BOOKS:
1. Compilers, Principles Techniques and Tools- Alfred V Aho, Monical S Lam, Ravi Sethi,
Jeffrey D. Ullman,2
nd
ed, Pearson,2007.
2. Principles of compiler design, V. Raghavan, 2
nd
ed, TMH, 2011.
3. Principles of compiler design, 2
nd
ed, Nandini Prasad, Elsevier
REFERENCE BOOKS:
1. http://www.nptel.iitm.ac.in/downloads/106108052/
2. Compiler construction, Principles and Practice, Kenneth C Louden, CENGAGE
3. Implementations of Compiler, A new approach to Compilers including the algebraic
methods, Yunlinsu, SPRINGER
www.rejinpaul.com
www.rejinpaul.com
Download Useful Materials from Rejinpaul.com
CS6660 COMPILER DESIGN ` UNIT-1
m
UNIT -1
1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM
1.2 Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more
modern flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language
by certain amounts to build-in macro
1.3 COMPILER
Compiler is a translator program that translates a program written in (HLL) the source
program and translate it into an equivalent program in (MLL) the target program. As an
important part of a compiler is error showing to the programmer.
Source pg target pgm
Compiler
Error msg
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN ` UNIT-1
Executing a program written n HLL programming language is basically of two parts. the
source program must first be compiled translated into a object program. Then the results
object program is loaded into a memory executed.
Source pgm
Compiler
obj pgm
Obj pgm input
Obj pgm
opj pgm output
1.4 ASSEMBLER: programmers found it difficult to write or read programs in machine
language. They begin to use a mnemonic (symbols) for each machine instruction, which
they would subsequently translate into machine language. Such a mnemonic machine
language is now called an assembly language. Programs known as assembler were
written to automate the translation of assembly language in to machine language. The
input to an assembler program is called source program, the output is a machine language
translation (object program).
1.5 INTERPRETER: An interpreter is a program that appears to execute a source
program as if it were machine language.
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also
uses interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds.
Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation.
The interpreter for the language makes it machine independent.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN ` UNIT-1
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
2 Loader and Link-editor:
Once the assembler procedures an object program, that program must be placed into
memory and executed. The assembler could place the object program directly in memory
and transfer control to it, thereby causing the machine language program to be
execute. This would waste core by leaving the assembler in memory while the users
program was being executed. Also the programmer would have to retranslate his program
with each execution, thus wasting translation time. To over come this problems of wasted
translation time and memory. System programmers developed another component called
loader
A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form the
loader could”relocate directly behind the users program. The task of adjusting programs o
they may be placed in arbitrary core locations is called relocation. Relocation loaders
perform four functions.
1.6 TRANSLATOR
A translator is a program that takes as input a program written in one language and
produces as output a program in another language. Beside program translation, the translator
performs another very important role, the error-detection. Any violation of d HLL
specification would be detected and reported to the programmers. Important role of translator
are:
the hll.
1 Translating the hll program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of
1.7 TYPE OF TRANSLATORS:-
INTERPRETOR
COMPILER
PREPROSSESSOR
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN ` UNIT-1
1.9 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated
operation that takes source program in one representation and produces output in another
representation. The phases of a compiler are shown in below
There are two phases of
compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into
PHASES OF A COMPILER
No-of-sub processes called ‘phases.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the
source program into a sequence of automic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this
phase expressions, statements, declarations etc… are identified by using the results of lexical
analysis. Syntax analysis is aided by using techniques based on formal grammar of the
programming language.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN ` UNIT-1
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced.
This phase bridges the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this phase. The
output of the code generator is the machine language program of the specified computer.
Table Management (or) Book-keeping:-
This is the portion to keep the names used by the program and records
essential information about each. The data structure used to record this information called a
Symbol Table.
Error Handlers:-
It is invoked when a flaw error in the source program is detected.
The output of LA is a stream of tokens, which is passed to the next phase, the
syntax analyzer or parser. The SA groups the tokens together into syntactic structure called
as expression. Expression may further be combined to form statements. The syntactic
structure can be regarded as a tree whose leaves are the token called as parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language. It also
imposes on tokens a tree-like structure that is used by the sub-sequent phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On seeing the /,
the syntax analyzer should detect an error situation, because the presence of these two
adjacent binary operators violates the formulations rule of an expression.
Syntax analysis is to make explicit the hierarchical structure of the incoming
token stream by identifying which parts of the token stream should be grouped.
Example, (A/B*C has two possible interpretations.)
1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.
Intermediate Code Generation:-
The intermediate code generation uses the structure produced by the syntax
analyzer to create a stream of simple instructions. Many styles of intermediate code are
possible. One common style uses instruction with one operator and a small number of
operands.
The output of the syntax analyzer is some representation of a parse tree. the
intermediate code generation phase transforms this parse tree into an intermediate language
representation of the source program.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN ` UNIT-1
Code Optimization
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code program that
does the some job as the original, but in a way that saves time and / or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program to
make an improvement. For example,
If A > B goto L2
L2 :
Goto L3
This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common
sub-expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing
the speed of loops. A typical loop improvement is to move a
computation that produces the same result each time around the loop
to a point, in the program just before the loop is entered.
Code generator :-
Cg produces the object code by deciding on the memory locations for data,
selecting code to access each datum and selecting the registers in which each computation is
to be done. Many computers have only a few high speed registers in which computations can
be performed quickly. A good code generator would attempt to utilize registers as efficiently
as possible.
Table Management OR Book-keeping :-
A compiler needs to collect information about all the data objects that appear
in the source program. The information about data objects is collected by the early phases of
the compiler-lexical and syntactic analyzers. The data structure used to record this
information is called as Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and
reporting of errors in the source program. The error message should allow the programmer to
determine exactly where the errors have occurred. Errors may occur in all or the phases of a
compiler.
Whenever a phase of the compiler discovers an error, it must report the error to
the error handler, which issues an appropriate diagnostic msg. Both of the table-management
and error-Handling routines interact with all phases of the compiler.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN ` UNIT-1
Example:
Position:= initial + rate *60
Lexical Analyzer
Tokens id1 = id2 + id3 * id4
Syntax Analyzer
=
id1 +
id2 *
id3 id4
Semantic Analyzer
=
id1 +
id2 *
id3 60
int to real
Intermediate Code Generator
temp1:= int to real (60)
temp2:= id3 * temp1
temp3:= id2 + temp2
id1:= temp3.
Code Optimizer
Temp1:= id3 * 60.0
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN ` UNIT-1
Id1:= id2 +temp1
Code Generator
MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1
1.10 TOKEN
LA reads the source program one character at a time, carving the source program into
a sequence of automatic units called „Tokens‟.
1, Type of the token.
2, Value of the token.
Type : variable, operator, keyword, constant
Value : N1ame of variable, current variable (or) pointer to symbol table.
If the symbols given in the standard format the LA accepts and produces
token as output. Each token is a sub-string of the program that is to be treated as a single
unit. Token are two types.
1, Specific strings such as IF (or) semicolon.
2, Classes of string such as identifiers, label, constants.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT II
UNIT II- LEXICAL ANALYSIS
LEXICAL ANALYSIS
Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A
program or function which performs lexical analysis is called a lexical analyzer or scanner. A lexer often
exists as a single function which is called by a parser or another function.
THE ROLE OF THE LEXICAL ANALYZER
The lexical analyzer is the first phase of a
c
om
pi
l
e
r
.
I
t
s
main task is to read the input characters and produce as output a sequence of
t
oke
ns
that the parser uses for syntax
ana
ly
s
i
s
.
s
our
c
e
progra
m
l
e
x
i
c
a
l
ana
ly
s
e
r
t
oke
n
get next
t
oke
n
pa
r
s
e
r
s
y
mbol
t
a
bl
e
Upon receiving a
g
e
t
next token” command from the parser, the lexical analyzer r
e
a
ds
input characters
until it can identify the next
t
oke
n.
ISSUES OF LEXICAL ANALYZER
There are three issues in lexical ana
ly
s
i
s
:
To make the design
s
i
mpl
e
r
.
To improve the efficiency of the
c
om
pi
l
e
r
.
To enhance the computer por
t
a
bi
li
t
y
.
TOKENS
A
token is a string of characters, categorized according to the rules as a symbol
(
e.
g
.,
I
D
ENT
I
F
I
ER
,
NUM
B
E
R
,
C
OM
MA
)
.
The process of forming tokens from an input stream
of
characters is called
tokenization
.
A
token can look like anything that is useful for processing an input text stream or
t
e
x
t
file. Consider this
expression in the
C
programming language:
s
um
=
3+
2;
Lexeme
Token type
s
um
I
dent
i
f
i
e
r
=
Assignment opera
t
or
3
Num
be
r
+
Addition
opera
t
or
2
Num
be
r
;
End of
s
t
a
t
e
m
ent
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT II
A
=
B
+
C
LEXEME:
Collection or group of characters forming tokens is called
L
e
x
e
m
e.
PATTERN:
A
pattern is a description of the form that the lexemes of a token may
t
ake
.
I
n
the case of a keyword as a token, the pattern is just the sequence of characters
t
ha
t
form the
keyword. For
identifiers and some other tokens, the pattern is a more complex
s
t
r
uc
t
ur
e
that is matched by many
s
t
r
i
ng
s
.
Attributes for Tokens
Some
tokens have attributes that can be passed back to the parser. The lexical
ana
ly
z
e
r
collects information
about tokens into their associated attributes. The attributes influence
t
he
translation of
t
oke
ns.
i)
Constant
:
value of the
c
ons
t
a
nt
ii)
I
dent
i
f
i
e
r
s
:
pointer to the corresponding symbol table ent
r
y
.
ERROR RECOVERY STRATEGIES IN LEXICAL ANALYSIS:
The following are the
e
rror
-
r
e
cove
r
y
actions in lexical ana
ly
s
i
s
:
1) Deleting
an extraneous
c
hara
c
t
e
r
.
2)
I
nse
r
t
i
ng
a missing
c
hara
c
t
e
r
.
3) Replacing
an incorrect character by a correct
c
hara
c
t
e
r
.
4) Transforming two adjacent
c
hara
c
t
e
r
s
.
5)
Panic mode
recovery
:
Deletion of successive characters from the token until error
i
s
r
e
s
olve
d.
INPUT BUFFERING
W
e
often have to look one or more characters beyond the next lexeme before we can be sure we have the right lexeme. As
characters are read from left to right, each character is
s
t
ored in the buffer to form a meaningful token as shown be
l
ow:
Forward
poi
nte
r
B
eg
i
nni
ng
of the token
L
ook
ahead
poi
nt
er
W
e
introduce a
t
wo
-
buf
f
er
scheme that handles large look aheads safely.
W
e
t
hen
consider an improvement involving
"
s
e
nt
i
ne
l
s
"
that saves time checking for the ends of
buf
f
er
s
.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT II
BUFFER PAIRS
A
buffer is divided into two
N
-
c
hara
c
t
e
r
halves, as shown
be
l
ow
: : E : :
=
: : M : *
C : * : : * :
2
:
e
of
l
e
x
e
m
e
_be
g
i
nni
ng
f
or
wa
r
d
Each buffer is of the same size N, and
N
is usually the number of characters on one
di
s
k
block.
E
.g.,
1024 or 4096 b
y
t
e
s
.
Using one system read command we can read
N
characters into a buf
f
e
r
.
If
fewer than
N
characters remain in the input file, then a special character, r
e
pr
e
s
ented
by
eof
,
marks the end of the source
f
il
e.
Two pointers to the input are
m
a
i
nt
a
i
ne
d:
1. Pointer
lexeme_beginning
,
marks the beginning of the current
l
e
x
e
m
e,
whose extent we are attempting to
de
t
e
r
m
i
ne.
2. Pointer
forward
scans ahead until a pattern match is
f
ound.
Once the next lexeme is determined, forward is set to the character at its
r
i
g
ht
end.
The string of characters between the two pointers is the current
l
e
x
e
m
e.
After
the lexeme is recorded as an attribute value of a token returned to the p
ar
s
e
r
,
lexeme_beginning is set to the character immediately after the lexeme just
f
ound.
Advancing forward pointer:
Advancing forward pointer requires that we first test whether we have reached the end
of
one of the buffers, and
if
so, we
must reload the other buffer from the input, and move forward
t
o the beginning of the newly loaded buffer.
If
the end of
second buffer is reached, we must aga
i
n reload the first buffer with input and the pointer wraps to the beginning of the
buf
f
e
r
.
Code to advance forward pointer:
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload second half;
move forward to beginning of first half
end
else forward := forward + 1;
SENTINELS
For
each character read, we make two tests: one for the end of the buffer, and one
t
o determine what character
is read.
W
e
can combine the buf
f
e
r
-
end
test with the test for
t
he
current character
if
we extend each buffer to
hold a sentinel character at the
end.
The sentinel is a special character that cannot be part of the source program, and a na
t
ura
l
choice is the
character
e
o
f
.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
The sentinel arrangement is as shown
be
l
ow:
: : E : :
=
: : M : * :
eof
C : * : : * :
2
:
eof
: : :
e
of
l
e
x
e
m
e
_be
g
i
nni
ng
f
or
wa
r
d
Note that eof retains its use as a marker for the end of the entire input. Any eof
t
ha
t
appears other than
at the end of a buffer means that the input is at an end.
Code to advance forward pointer:
forward : = forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward +
1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end
SPECIFICATION OF TOKENS
There are 3 specifications of
t
oke
ns
:
1)
S
t
r
i
ng
s
2)
L
angua
ge
3) Regular
e
xpr
e
ss
i
on
Strings and Languages
An
alphabet
or character class is a finite set of
s
y
mbol
s
.
A string
over an alphabet is a finite sequence of symbols drawn from that a
l
phabe
t
.
A language
is
any countable set of strings over some fixed a
l
phabe
t
.
I
n
language theory, the terms
"
s
ente
n
c
e
"
and
"word"
are often used as synonyms
f
or
"
s
t
r
i
ng
."
The length of a
string s, usually written
|
s
|
,
is the number of occurrences of symbols in
s
.
For
example, banana is a string of length six.
The empty string, denoted ε, is the string of
l
e
ng
t
h
z
e
ro.
Operations on strings
The following
s
t
r
i
ng
-
r
e
l
a
t
ed
terms are commonly
use
d:
1.
A prefix
of string s is any string obtained by removing zero or more symbols from the end
of
string
s
.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
For
example, ban is a prefix of
ba
nana
.
2.
A suffix
of string s is any string obtained by removing zero or more symbols from
t
he
beginning of
s
.
For
example, nana is a suffix of
ba
nana
.
3.
A substring
of s is obtained by deleting any prefix and any suffix from
s
.
For
example, nan is a substring of
ba
nana
.
4. The
proper prefixes, suffixes, and substrings
of a string s are those prefixes, suffixes,
a
nd substrings,
respectively of s that are not ε or not equal to s
i
t
s
e
l
f
.
5.
A
subsequence of s is any string formed by deleting zero or more not necessarily
c
onse
c
ut
i
ve
positions of
s
.
For
example, baan is a subsequence of
b
anana
.
Operations on languages:
The following are the operations that can be applied to
l
anguag
e
s
:
1.Uni
on
2.C
onca
t
e
na
t
i
on
3.Kleene
c
l
os
ur
e
4.Positive
c
l
os
ur
e
The following example shows the operations on
s
t
r
i
ng
s
:
L
e
t
L
=
{0,1}
and
S
=
{
a
,b,
c
}
1. Union
: L U
S
=
{0,1,a
,b,c
}
2. Concatenation
:
L
.S
=
{0a
,1a,0b,1b,0
c
,1c
}
3. Kleene closure
:
L
*
=
{
ε
,0,1,00
.}
4. Positive closure
:
L
+
=
{0,1,00…
.}
Regular Expressions
Each regular expression r denotes a language
L
(
r
)
.
Here are the rules that define the regular expressions over some alphabet
Σ
and the
l
angua
ge
s
that those expressions
denote
:
1. ε is a regular expression, and
L
(
ε
)
is
{
ε
},
that is, the language whose sole member is
t
he
empty
s
t
r
i
ng
.
2.
If
a is a symbol in Σ,
then
„a
is a regular expression, and
L
(
a
)
=
{a},
that is, the
l
anguag
e
with one string, of
length one, with
a
in its one
pos
i
t
i
on.
3.
S
uppose
r and s are regular expressions denoting the languages
L
(
r
)
and
L
(
s
)
.
The
n,
a)
(
r
)
|
(
s
)
is a
regular expression denoting the language
L
(
r
) U
L
(
s
)
.
b)
(
r
)(
s
)
is a regular expression denoting the language
L
(
r
)
L
(
s
)
.
c)
(r)*
is a regular expression denoting
(
L
(
r
))
*.
d) (r)
is a regular expression denoting
L
(
r
)
.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
4. The unary operator
*
has highest precedence and is left a
ss
oc
i
a
t
i
ve
.
5. Concatenation has second highest precedence and is left a
ss
oc
i
a
t
i
ve
.
6. |
has lowest precedence and is left a
ss
oc
i
a
t
i
ve
.
Regular set
A
language that can be defined by a regular expression is called a regular
s
e
t
.
If
two regular expressions r and s denote the same regular set, we say they are equivalent a
nd
Write r = s
.
There are a number of algebraic laws for regular expressions that can be used
t
o manipulate into
equivalent
f
or
m
s
.
For
instance, r
|
s
=
s
|
r
is commutative; r
|
(
s
|
t
)
=
(
r
|
s
)
|
t
is a
ss
oc
i
a
t
i
ve
.
Regular Definitions
Giving names to regular expressions is referred to as a
Regular definition.
If Σ
is an alphabet of basic
symbols, then a regular definition is a sequence of definitions of the
f
or
m
d
l
r
1
d
2
r
2
………
d
n
r
n
1. Each d
i
is a distinct na
m
e.
2. Each r
i
is a regular expression over the alphabet
Σ U {d
l
,
d
2
,. . . , d
i
-
l
}
.
Example:
I
dent
i
f
i
e
r
s
is the set of strings of letters and digits beginning with a letter.
R
e
g
ul
ar definition for this
s
e
t
:
letter
A
|
B
|
….
|
Z
|
a
|
b
|
….
| z |
digit
0 | 1 |
….
|
9
id
letter
(
letter
|
digit
)
*
Shorthands
Certain constructs occur so frequently in regular expressions that it is convenient
t
o
introduce
notational shorthands for
t
he
m
.
1. One or more instances (+):
-
The unary postfix operator + means
one or more instances
of
.
-
If
r is a regular expression that denotes the language
L
(
r
)
,
then
(
r )
+
is a regular
e
xpr
e
ss
i
on
that denotes the
language
(L
(r
))
+
-
Thus the regular expression a
+
denotes the set of all strings of one or more a
s
.
-
The operator
+
has the same precedence and associativity as the operator
*
.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
if
if
t
he
n
t
h
e
n
e
l
s
e
e
l
s
e
r
e
l
op
<
|
<=
|
=
|
<>
|
>
|>
=
i
d
l
e
tt
e
r
(
l
e
tt
e
r
|
di
g
i
t
)
*
num
digit
+
(
.di
g
i
t
+
)
?
(
E
(
+
|
-
)
?
di
g
i
t
+
)
?
2. Zero or one instance ( ?):
-
The unary postfix operator
?
means
z
e
ro
or one instance
of
.
-
The notation
r?
is a shorthand for r
|
ε.
-
If
r‟
is a regular expression, then
(
r
)?
is a regular expression that denotes the
l
anguag
e
L(
r
) U
{
ε
}.
3. Character Classes:
-
The notation [abc] where a, b and c
are alphabet symbols denotes the regular e
xpr
e
ss
i
on a
|
b
|
c
.
-
Character class such as [a
z]
denotes the regular expression a
|
b
|
c
|
d
|
.
|
z
.
-
W
e
can describe identifiers as being strings generated by the regular
e
xpr
e
ss
i
on,
[
A
Z
a
z
][
A
Z
a
z
0
9]
*
Non-regular Set
A
language which cannot be described by any regular expression is a non
-
r
e
g
ul
ar
s
e
t
. Example: The set of all strings of
balanced parentheses and repeating strings cannot be de
s
c
r
i
bed
by a regular expression. This set can be specified by a
c
onte
x
t
-
f
r
ee
gra
mm
ar
.
RECOGNITION OF TOKENS
Consider the following grammar
f
rag
m
ent
:
stmt
if
expr then stmt
|
if
expr then stmt else stmt
|
ε
expr
term relop
t
e
r
m
|
t
e
r
m
term
i
d
|
num
where the terminals
if
, then, else, relop, id and num generate sets of strings given by
t
he
following regular
de
f
i
ni
t
i
ons
:
For
this language fragment the lexical analyzer
will
recognize the keywords if, then,
e
l
s
e,
as well as the
lexemes denoted by relop, id, and num. To simplify matters, we assume ke
y
w
or
ds
are reserved; that is, they cannot be
used as
i
dent
i
f
i
e
r
s
.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Transition diagrams
I
t
is a diagrammatic representation to depict the action that will take place when a
l
e
x
i
c
a
l
analyzer is called
by the parser to get the next token.
I
t
is used to keep track of information a
bout the characters that are seen as the forward
pointer scans the
i
nput.
A LANGUAGE FOR SPECIFYING LEXICAL ANALYZER
There is a wide range of tools for constructing lexical ana
ly
z
e
r
s
.
L
e
x
YAC
C
LEX
L
e
x
is a computer program that generates lexical analyzers.
L
e
x
is commonly used
wi
t
h
the yacc parser g
ene
ra
t
or
.
Creating a lexical analyzer
First, a specification of a
lexical analyzer is prepared by creating a program lex.l in
t
he
L
e
x
language.
Then, lex.l is run through the
L
e
x
compiler to produce a
C
p
rogra
m
l
e
x
.
yy
.
c
.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Finally, lex.yy.c
is run through the
C
compiler to produce an object program a.out,
whi
c
h is the lexical
analyzer that transforms an input stream into a sequence of
t
oke
ns.
l
e
x
.l
L
e
x
c
om
pi
l
e
r
l
e
x
.
yy
.
c
l
e
x
.
yy
.
c
C
c
om
pi
l
e
r
a
.out
i
nput
a
.out
sequence
of
stream
t
oke
ns
Lex Specification
A
L
e
x
program consists of three
p
ar
t
s
:
{
definitions
}
%%
{
rules
}
%%
{
user subroutines
}
Definitions
include declarations of variables, constants, and regular
de
f
i
ni
t
i
ons
Rules
are statements of the
f
or
m
p
1
{a
c
t
i
on
1
}
p
2
{a
c
t
i
on
2
}
p
n
{a
c
t
i
on
n
}
where p
i
is regular expression and action
i
describes what action the lexical ana
ly
z
e
r
should take when pattern p
i
matches a lexeme.
Actions are written in
C
c
ode
.
User subroutines
are auxiliary procedures needed by the actions. These can be
c
om
pi
l
ed
separately
and loaded with the lexical ana
ly
z
e
r
.
YACC- YET ANOTHER COMPILER-COMPILER
Yacc provides a general tool for describing the input to a computer program. The
Yacc
user specifies the structures of
his input, together with code to be invoked as each such
s
t
r
uc
t
ur
e
is recognized. Yacc turns such a specification into a
subroutine that handles the input pr
oc
e
ss
;
frequently, it is convenient and appropriate to have most of the flow of control
in the
use
r
'
s
application handled by this
s
ubr
out
i
ne.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
FINITE AUTOMATA
Finite Automata is one of
the mathematical models that consist of a number of states
a
nd edges.
I
t
is a transition diagram
that recognizes a regular expression or gra
mm
ar
.
Types of Finite Automata
There are tow types of Finite Automata
:
Non
-
de
t
e
r
m
i
ni
s
t
i
c
Finite Automata
(
N
FA
)
Deterministic Finite Automata
(
D
FA
)
Non-deterministic Finite Automata
NFA
is a mathematical model that consists of five tuples denoted b
y
M
=
{Q
n
,
Ʃ
,
δ, q
0
,
f
n
}
Q
n
finite set of
s
t
a
t
e
s
Ʃ
finite set of input
s
y
mbol
s
δ
transition function that maps
s
t
a
t
e
-
s
y
mbol
pairs to set of
s
t
a
t
e
s
q
0
starting
s
t
a
t
e
f
n
final
s
t
a
t
e
Deterministic Finite Automata
DFA
is a special case of a
NFA
in
whi
c
h
i)
no state has an ε
-
t
ra
ns
i
t
i
on
.
ii)
there is at most one transition from each state on any
i
nput.
DFA
has five tuples denoted
b
y
M
=
{Q
d
,
Ʃ
,
δ
, q
0
,
f
d
}
Q
d
finite set of
s
t
a
t
e
s
Ʃ
finite set of input
s
y
mbol
s
δ
transition function that maps
s
t
a
t
e
-
s
y
m
bo
l
pairs to set of
s
t
a
t
e
s
q
0
starting
s
t
a
t
e
f
d
final
s
t
a
t
e
Converting a Regular Expression into a Deterministic Finite Automaton
The task of a scanner generator, such as flex, is to generate the transition tables or to
synthesize the scanner program given a scanner specification (in the form of a set of REs).
So it needs to convert a RE into a DFA. This is accomplished in two steps: first it converts
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
a RE into a non-deterministic finite automaton (NFA) and then it converts the NFA into a
DFA.
A NFA is similar to a DFA but it also permits multiple transitions over the same character
and transitions over . The first type indicates that, when reading the common character
associated with these transitions, we have more than one choice; the NFA succeeds if at
least one of these choices succeeds. The transition doesn't consume any input characters,
so you may jump to another state for free.
Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same
expressive power. The problem is that when converting a NFA to a DFA we may get an
exponential blowup in the number of states.
We will first learn how to convert a RE into a NFA. This is the easy part. There are only 5
rules, one for each type of RE:
The algorithm constructs NFAs with only one final state. For example, the third rule
indicates that, to construct the NFA for the RE AB, we construct the NFAs for A and B
which are represented as two boxes with one start and one final state for each box. Then
the NFA for AB is constructed by connecting the final state of A to the start state of B using
an empty transition.
For example, the RE (a|b)c is mapped to the following NFA:
The next step is to convert a NFA to a DFA (called subset construction). Suppose that you
assign a number to each NFA state. The DFA states generated by subset construction have
sets of numbers, instead of just one number. For example, a DFA state may have been
assigned the set {5,6,8}. This indicates that arriving to the state labeled {5,6,8} in the DFA
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
is the same as arriving to the state 5, the state 6, or the state 8 in the NFA when parsing the
same input. (Recall that a particular input sequence when parsed by a DFA, leads to a
unique state, while when parsed by a NFA it may lead to multiple states.)
First we need to handle transitions that lead to other states for free (without consuming any
input). These are the transitions. We define the closure of a NFA node as the set of all the
nodes reachable by this node using zero, one, or more transitions. For example, The
closure of node 1 in the left figure below
is the set {1,2}. The start state of the constructed DFA is labeled by the closure of the NFA
start state. For every DFA state labeled by some set and for every character c in
the language alphabet, you find all the states reachable by s
1
, s
2
, ..., or s
n
using c arrows
and you union together the closures of these nodes. If this set is not the label of any other
node in the DFA constructed so far, you create a new DFA node with this label. For
example, node {1,2} in the DFA above has an arrow to a {3,4,5} for the character a since
the NFA node 3 can be reached by 1 on a and nodes 4 and 5 can be reached by 2. The b
arrow for node {1,2} goes to the error node which is associated with an empty set of NFA
nodes. The following NFA recognizes , even though it wasn't constructed
with the 5 RE-to-NFA rules. It has the following DFA:
Converting NFAs to DFAs
To convert an NFA to a DFA, we must and a way to remove all "-transitions and to ensure
that there is one transition per symbol in each state. We do this by constructing a DFA in
which each state corresponds to a set of some states from the NFA. In the DFA, transitions
from a state S by some symbol go to the state S that consists of all the possible NFA-states
that could be reached by from some NFA state q contained in the present DFA state S. The
resulting DFA \simulates" the given NFA in the sense that a single DFA-transition
represents many simultaneous NFA-transitions. The first concept we need is the "E-closure
pronounced \epsilon closure". The " -closure of an NFA state q is the set containing q
along with all states in the automaton that are reachable by any number of " E-transitions
from q . In the following automaton, the " E-closures are given in the table to the right:
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Likewise, we can done the "-closure of a set of states to be the states reachable by " -
transitions from its members. In other words, this is the union of the " -closures of its
elements. To convert our NFA to its DFA counterpart, we begin by taking the " closure
of the start state q of our NFA and constructing a new start state S. in our DFA
corresponding to that " -closure. Next, for each symbol in our alphabet, we record the set
of NFA states that we can reach from S 0on that symbol. For each such set, we make a
DFA state corresponding to its "E-closure, taking care to do this only once for each set. In
the case two sets are equal, we simply reuse the existing DFA state that we already
constructed. This process is then repeated for each of the new DFA states (that is, set of
NFA states) until we run out of DFA states to process. Finally, every DFA state whose
corresponding set of NFA states contains an accepting state is itself marked as an
accepting state.
2.17 Lex specifications:
A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
1. The declarations section includes declarations of variables,manifest constants(A
manifest constant is an identifier that is declared to represent a constant e.g. #
define PIE 3.14), and regular definitions.
2. The translation rules of a Lex program are statements of the form :
p1
{action 1}
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
p2
{action 2}
p3
{action 3}
where each p is a regular expression and each action is a program fragment
describing what action the lexical analyzer should take when a pattern p matches a
lexeme. In Lex the actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the actions.
Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
UNIT -3
SYNTAX ANALYSIS
3.1 ROLE OF THE PARSER
Parser obtains a string of tokens from the lexical analyzer and verifies that it can be generated
by the language for the source program. The parser should report any syntax errors in an
intelligible fashion. The two types of parsers employed are:
1.Top down parser: which build parse trees from top(root) to bottom(leaves)
2.Bottom up parser: which build parse trees from leaves and work up the root.
Therefore there are two types of parsing methods top-down parsing and bottom-up parsing
3.2 TOP-DOWN PARSING
A program that performs syntax analysis is called a parser. A syntax analyzer takes tokens as
input and output error message if the program syntax is wrong. The parser uses symbol-look-
ahead and an approach called top-down parsing without backtracking. Top-downparsers
check to see if a string can be generated by a grammar by creating a parse tree starting from
the initial symbol and working down. Bottom-up parsers, however, check to see a string can
be generated from a grammar by creating a parse tree from the leaves, and working up. Early
parser generators such as YACC creates bottom-up parsers whereas many of Java parser
generators such as JavaCC create top-down parsers.
3.3RECURSIVE DESCENT PARSING
Typically, top-down parsers are implemented as a set of recursive functions that descent
through a parse tree for a string. This approach is known as recursive descent parsing, also
known as LL(k) parsing where the first L stands for left-to-right, the second L stands for
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
leftmost-derivation, and k indicates k-symbol lookahead. Therefore, a parser using the single
symbol look-ahead method and top-down parsing without backtracking is called LL(1)
parser. In the following sections, we will also use an extended BNF notation in which some
regulation expression operators are to be incorporated.
A syntax expression defines sentences of the form , or . A syntax of the form defines
sentences that consist of a sentence of the form followed by a sentence of the form followed
by a sentence of the form . A syntax of the form defines zero or one occurrence of the form .
A syntax of the form defines zero or more occurrences of the form .
A usual implementation of an LL(1) parser is:
initialize its data structures,
get the lookahead token by calling scanner routines, and
call the routine that implements the start symbol.
Here is an example.
proc syntaxAnalysis()
begin
initialize(); // initialize global data and structures
nextToken(); // get the lookahead token
program(); // parser routine that implements the start symbol
end;
3.4 FIRST AND FOLLOW
To compute FIRST(X) for all grammar symbols X, apply the following rules until
no more terminals or e can be added to any FIRST set.
1. If X is terminal, then FIRST(X) is {X}.
2. If X->e is a production, then add e to FIRST(X).
3. If X is nonterminal and X->Y1Y2...Yk is a production, then place a in FIRST(X) if for
some i, a is in FIRST(Yi) and e is in all of FIRST(Y1),...,FIRST(Yi-1) that is,
Y1.......Yi-
1
=*>e. If e is in FIRST(Yj) for all j=1,2,...,k, then add e to FIRST(X). For
example, everything in FIRST(Yj) is surely in FIRST(X). If y1 does not derive e, then we
add nothing more to FIRST(X), but if Y1=*>e, then we add FIRST(Y2) and so on.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
To compute the FIRST(A) for all nonterminals A, apply the following rules until nothing
can be added to any FOLLOW set.
1. Place $ in FOLLOW(S), where S is the start symbol and $ in the input right endmarker.
2. If there is a production A=>aBs where FIRST(s) except e is placed in FOLLOW(B).
3. If there is aproduction A->aB or a production A->aBs where FIRST(s) contains e, then
everything in FOLLOW(A) is in FOLLOW(B).
Consider the following example to understand the concept of First and Follow.Find the first
and follow of all nonterminals in the Grammar-
E -> TE'
E'-> +TE'|e
T -> FT'
T'-> *FT'|e
F -> (E)|id
Then:
FIRST(E)=FIRST(T)=FIRST(F)={(,id}
FIRST(E')={+,e}
FIRST(T')={*,e}
FOLLOW(E)=FOLLOW(E')={),$}
FOLLOW(T)=FOLLOW(T')={+,),$}
FOLLOW(F)={+,*,),$}
For example, id and left parenthesis are added to FIRST(F) by rule 3 in definition of FIRST
with i=1 in each case, since FIRST(id)=(id) and FIRST('(')= {(} by rule 1. Then by rule 3
with i=1, the production T -> FT' implies that id and left parenthesis belong to FIRST(T)
also.
To compute FOLLOW,we put $ in FOLLOW(E) by rule 1 for FOLLOW. By rule 2 applied
toproduction F-> (E), right parenthesis is also in FOLLOW(E). By rule 3 applied to
production E-> TE', $ and right parenthesis are in FOLLOW(E').
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
3.5 CONSTRUCTION OF PREDICTIVE PARSING TABLES
For any grammar G, the following algorithm can be used to construct the predictive parsing
table. The algorithm is
Input : Grammar G
Output : Parsing table M
Method
1. 1.For each production A-> a of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(a), add A->a, to M[A,a].
3. If e is in First(a), add A->a to M[A,b] for each terminal b in FOLLOW(A). If e is in
FIRST(a) and $ is in FOLLOW(A), add A->a to M[A,$].
4. Make each undefined entry of M be error.
3.6.LL(1) GRAMMAR
The above algorithm can be applied to any grammar G to produce a parsing table M. For
some Grammars, for example if G is left recursive or ambiguous, then M will have at least
one multiply-defined entry. A grammar whose parsing table has no multiply defined entries
is said to be LL(1). It can be shown that the above algorithm can be used to produce for every
LL(1) grammar G a parsing table M that parses all and only the sentences of G. LL(1)
grammars have several distinctive properties. No ambiguous or left recursive grammar can
be LL(1). There remains a question of what should be done in case of multiply defined
entries. One easy solution is to eliminate all left recursion and left factoring, hoping to
produce a grammar which will produce no multiply defined entries in the parse tables.
Unfortunately there are some grammars which will give an LL(1) grammar after any kind of
alteration. In general, there are no universal rules to convert multiply defined entries into
single valued entries without affecting the language recognized by the parser.
The main difficulty in using predictive parsing is in writing a grammar for the source
language such that a predictive parser can be constructed from the grammar. Although left
recursion elimination and left factoring are easy to do, they make the resulting grammar hard
to read and difficult to use the translation purposes. To alleviate some of this difficulty, a
common organization for a parser in a compiler is to use a predictive parser for control
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
constructs and to use operator precedence for expressions.however, if an lr parser generator
is available, one can get all the benefits of predictive parsing and operator precedence
automatically.
3.7.ERROR RECOVERY IN PREDICTIVE PARSING
The stack of a nonrecursive predictive parser makes explicit the terminals and nonterminals
that the parser hopes to match with the remainder of the input. We shall therefore refer to
symbols on the parser stack in the following discussion. An error is detected during
predictive parsing when the terminal on top of the stack does not match the next input
symbol or when nonterminal A is on top of the stack, a is the next input symbol, and the
parsing table entry M[A,a] is empty.
Panic-mode error recovery is based on the idea of skipping symbols on the input until a token
in a selected set of synchronizing tokens appears. Its effectiveness depends on the choice of
synchronizing set. The sets should be chosen so that the parser recovers quickly from errors
that are likely to occur in practice. Some heuristics are as follows
As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing
set for nonterminal A. If we skip tokens until an element of FOLLOW(A) is seen and
pop A from the stack, it is likely that parsing can continue.
It is not enough to use FOLLOW(A) as the synchronizingset for A. Fo example , if
semicolons terminate statements, as in C, then keywords that begin statements may
not appear in the FOLLOW set of the nonterminal generating expressions. A missing
semicolon after an assignment may therefore result in the keyword beginning the next
statement being skipped. Often, there is a hierarchica structure on constructs in a
language; e.g., expressions appear within statement, which appear within bblocks,and
so on. We can add to the synchronizing set of a lower construct the symbols that
begin higher constructs. For example, we might add keywords that begin statements
to the synchronizing sets for the nonterminals generaitn expressions.
If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it
may be possible to resume parsing according to A if a symbol in FIRST(A) appears in
the input.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
If a nonterminal can generate the empty string, then the production deriving e can be
used as a default. Doing so may postpone some error detection, but cannot cause an
error to be missed. This approach reduces the number of nonterminals that have to be
considered during error recovery.
If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue parsing.
In effect, this approach takes the synchronizing set of a token to consist of all other
tokens.
3.8 LR PARSING INTRODUCTION
The "L" is for left-to-right scanning of the input and the "R" is for constructing a rightmost
derivation in reverse.
WHY LR PARSING:
LR parsers can be constructed to recognize virtually all programming-language
constructs for which context-free grammars can be written.
The LR parsing method is the most general non-backtracking shift-reduce parsing
method known, yet it can be implemented as efficiently as other shift-reduce
methods.
The class of grammars that can be parsed using LR methods is a proper subset of the
class of grammars that can be parsed with predictive parsers.
An LR parser can detect a syntactic error as soon as it is possible to do so on a left-
to- right scan of the input.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
The disadvantage is that it takes too much work to constuct an LR parser by hand for
a typical programming-language grammar. But there are lots of LR parser generators
available to make this task easy.
MODELS OF LR PARSERS
The schematic form of an LR parser is shown below.
The program uses a stack to store a string of the form s0X1s1X2...Xmsm where sm is on top.
Each Xi is a grammar symbol and each si is a symbol representing a state. Each state symbol
summarizes the information contained in the stack below it. The combination of the state
symbol on top of the stack and the current input symbol are used to index the parsing table
and determine the shiftreduce parsing decision. The parsing table consists of two parts: a
parsing action function action and a goto function goto. The program driving the LR parser
behaves as follows: It determines sm the state currently on top of the stack and ai the current
input symbol. It then consults action[sm,ai], which can have one of four values:
shift s, where s is a state
reduce by a grammar production A -> b
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
accept
error
The function goto takes a state and grammar symbol as arguments and produces a state.
For a parsing table constructed for a grammar G, the goto table is the transition function of a
deterministic finite automaton that recognizes the viable prefixes of G. Recall that the viable
prefixes of G are those prefixes of right-sentential forms that can appear on the stack of a
shiftreduce parser because they do not extend past the rightmost handle.
A configuration of an LR parser is a pair whose first component is the stack contents and
whose second component is the unexpended input:
(s0 X1 s1 X2 s2... Xm sm, ai ai+1... an$)
This configuration represents the right-sentential form
X1 X1 ... Xm ai ai+1 ...an
in essentially the same way a shift-reduce parser would; only the presence of the states on the
stack is new. Recall the sample parse we did (see Example 1: Sample bottom-up parse) in
which we assembled the right-sentential form by concatenating the remainder of the input
buffer to the top of the stack. The next move of the parser is determined by reading ai and
sm, and consulting the parsing action table entry action[sm, ai]. Note that we are just looking
at the state here and no symbol below it. We'll see how this actually works later.
The configurations resulting after each of the four types of move are as follows:
If action[sm, ai] = shift s, the parser executes a shift move entering the configuration
(s0 X1 s1 X2 s2... Xm sm ai s, ai+1... an$)
Here the parser has shifted both the current input symbol ai and the next symbol.
If action[sm, ai] = reduce A -> b, then the parser executes a reduce move, entering the
configuration,
(s0 X1 s1 X2 s2... Xm-r sm-r A s, ai ai+1... an$)
where s = goto[sm-r, A] and r is the length of b, the right side of the production. The parser
first popped 2r symbols off the stack (r state symbols and r grammar symbols), exposing state
sm-r. The parser then pushed both A, the left side of the production, and s, the entry for
goto[sm-r, A], onto the stack. The current input symbol is not changed in a reduce move.
The output of an LR parser is generated after a reduce move by executing the semantic action
associated with the reducing production. For example, we might just print out the production
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY LECTURE N0TES
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
reduced.
If action[sm, ai] = accept, parsing is completed.
3.9 SHIFT REDUCE PARSING
A shift-reduce parser uses a parse stack which (conceptually) contains grammar symbols.
During the operation of the parser, symbols from the input are shifted onto the stack. If a
prefix of the symbols on top of the stack matches the RHS of a grammar rule which is the
correct rule to use within the current context, then the parser reduces the RHS of the rule to
its LHS,replacing the RHS symbols on top of the stack with the nonterminal occurring on the
LHS of the rule. This shift-reduce process continues until the parser terminates, reporting
either success or failure. It terminates with success when the input is legal and is accepted by
the parser. It terminates with failure if an error is detected in the input. The parser is nothing
but a stack automaton which may be in one of several discrete states. A state is usually
represented simply as an integer. In reality, the parse stack contains states, rather than
grammar symbols. However, since each state corresponds to a unique grammar symbol, the
state stack can be mapped onto the grammar symbol stack mentioned earlier.
The operation of the parser is controlled by a couple of tables:
ACTION TABLE
The action table is a table with rows indexed by states and columns indexed by terminal
symbols. When the parser is in some state s and the current lookahead terminal is t, the
action taken by the parser depends on the contents of action[s][t], which can contain four
different kinds of entries:
Shift s'
Shift state s' onto the parse stack.
Reduce r
Reduce by rule r. This is explained in more detail below.
Accept
Terminate the parse with success, accepting the input.
Error
Signal a parse error
GOTO TABLE
The goto table is a table with rows indexed by states and columns indexed by nonterminal
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
symbols. When the parser is in state s immediately after reducing by rule N, then the next
state to enter is given by goto[s][N].
The current state of a shift-reduce parser is the state on top of the state stack. The detailed
operation of such a parser is as follows:
1. Initialize the parse stack to contain a single state s0, where s0 is the distinguished initial
state of the parser.
2. Use the state s on top of the parse stack and the current lookahead t to consult the action
table entry action[s][t]:
· If the action table entry is shift s' then push state s' onto the stack and advance the
input so that the lookahead is set to the next token.
· If the action table entry is reduce r and rule r has m symbols in its RHS, then pop
m symbols off the parse stack. Let s' be the state now revealed on top of the parse
stack and N be the LHS nonterminal for rule r. Then consult the goto table and
push the state given by goto[s'][N] onto the stack. The lookahead token is not
changed by this step.
If the action table entry is accept, then terminate the parse with success.
If the action table entry is error, then signal an error.
3. Repeat step (2) until the parser terminates.
For example, consider the following simple grammar
0) $S: stmt <EOF>
1) stmt: ID ':=' expr
2) expr: expr '+' ID
3) expr: expr '-' ID
4) expr: ID
which describes assignment statements like a:= b + c - d. (Rule 0 is a special augmenting
production added to the grammar).
One possible set of shift-reduce parsing tables is shown below (sn denotes shift n, rn denotes
reduce n, acc denotes accept and blank entries denote error entries):
Parser Tables
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
3.10 SLR PARSER
An LR(0) item (or just item) of a grammar G is a production of G with a dot at some position
of the right side indicating how much of a production we have seen up to a given point.
For example, for the production E -> E + T we would have the following items:
[E -> .E + T]
[E -> E. + T]
[E -> E +. T]
[E -> E + T.]
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
CONSTRUCTING THE SLR PARSING TABLE
To construct the parser table we must convert our NFA into a DFA. The states in the LR
table will be the e-closures of the states corresponding to the items SO...the process of
creating the LR state table parallels the process of constructing an equivalent DFA from a
machine with e-transitions. Been there, done that - this is essentially the subset construction
algorithm so we are in familiar territory here.
We need two operations: closure()
and goto().
closure()
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I by
the two rules: Initially every item in I is added to closure(I)
If A -> a.Bb is in closure(I), and B -> g is a production, then add the initial item [B -> .g] to I,
if it is not already there. Apply this rule until no more new items can be added to closure(I).
From our grammar above, if I is the set of one item {[E'-> .E]}, then closure(I) contains:
I0: E' -> .E
E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
goto()
goto(I, X), where I is a set of items and X is a grammar symbol, is defined to be the closure
of the set of all items [A -> aX.b] such that [A -> a.Xb] is in I. The idea here is fairly intuitive:
if I is the set of items that are valid for some viable prefix g, then goto(I, X) is the set of items
that are valid for the viable prefix gX.
SETS-OF-ITEMS-CONSTRUCTION
To construct the canonical collection of sets of LR(0) items for
augmented grammar G'.
procedure items(G')
begin
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
C := {closure({[S' -> .S]})};
repeat
for each set of items in C and each grammar symbol X
such that goto(I, X) is not empty and not in C do
add goto(I, X) to C;
until no more sets of items can be added to C
end;
ALGORITHM FOR CONSTRUCTING AN SLR PARSING TABLE
Input: augmented grammar G'
Output: SLR parsing table functions action and goto for G'
Method:
Construct C = {I0, I1 , ..., In} the collection of sets of LR(0) items for G'.
State i is constructed from Ii:
if [A -> a.ab] is in Ii and goto(Ii, a) = Ij, then set action[i, a] to "shift j". Here a must be a
terminal.
if [A -> a.] is in Ii, then set action[i, a] to "reduce A -> a" for all a in FOLLOW(A). Here A
may
not be S'.
if [S' -> S.] is in Ii, then set action[i, $] to "accept"
If any conflicting actions are generated by these rules, the grammar is not SLR(1) and the
algorithm fails to produce a parser. The goto transitions for state i are constructed for all
nonterminals A using the rule: If goto(Ii, A)= Ij, then goto[i, A] = j.
All entries not defined by rules 2 and 3 are made "error".
The inital state of the parser is the one constructed from the set of items containing [S' -> .S].
Let's work an example to get a feel for what is going on,
An Example
(1) E -> E * B
(2) E -> E + B
(3) E -> B
(4) B -> 0
(5) B -> 1
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
The Action and Goto Table The two LR(0) parsing tables for this grammar look as follows:
3.11 LALR PARSER:
We begin with two observations. First, some of the states generated for LR(1) parsing have
the same set of core (or first) components and differ only in their second component, the
lookahead symbol. Our intuition is that we should be able to merge these states and reduce
the number of states we have, getting close to the number of states that would be generated
for LR(0) parsing. This observation suggests a hybrid approach: We can construct the
canonical LR(1) sets of items and then look for sets of items having the same core. We merge
these sets with common cores into one set of items. The merging of states with common
cores can never produce a shift/reduce conflict that was not present in one of the original
states because shift actions depend only on the core, not the lookahead. But it is possible for
the merger to produce a reduce/reduce conflict.
Our second observation is that we are really only interested in the lookahead symbol in
places where there is a problem. So our next thought is to take the LR(0) set of items and add
lookaheads only where they are needed. This leads to a more efficient, but much more
complicated method.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
ALGORITHM FOR EASY CONSTRUCTION OF AN LALR TABLE
Input: G'
Output: LALR parsing table functions with action and goto for G'.
Method:
1. Construct C = {I0, I1 , ..., In} the collection of sets of LR(1) items for G'.
2. For each core present among the set of LR(1) items, find all sets having that core
and replace these sets by the union.
3. Let C' = {J0, J1 , ..., Jm} be the resulting sets of LR(1) items. The parsing actions
for state i are constructed from Ji in the same manner as in the construction of the
canonical LR parsing table.
4. If there is a conflict, the grammar is not LALR(1) and the algorithm fails.
5. The goto table is constructed as follows: If J is the union of one or more sets of
LR(1) items, that is, J = I0U I1 U ... U Ik, then the cores of goto(I0, X), goto(I1,
X), ..., goto(Ik, X) are the same, since I0, I1 , ..., Ik all have the same core. Let K
be the union of all sets of items having the same core asgoto(I1, X).
6. Then goto(J, X) = K.
Consider the above example,
I3 & I6 can be replaced by their union
I36:C->c.C,c/d/$
C->.Cc,C/D/$
C->.d,c/d/$
I47:C->d.,c/d/$
I89:C->Cc.,c/d/$
Parsing Table
state
c
d
$
S
C
0
S36
S47
1
2
1
Accept
2
S36
S47
5
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
36
S36
S47
89
47
R3
R3
5
R1
89
R2
R2
R2
HANDLING ERRORS
The LALR parser may continue to do reductions after the LR parser would have spotted an
error, but the LALR parser will never do a shift after the point the LR parser would have
discovered the error and will eventually find the error.
3.12 LR ERROR RECOVERY
An LR parser will detect an error when it consults the parsing action table and find a blank or
error entry. Errors are never detected by consulting the goto table. An LR parser will detect
an error as soon as there is no valid continuation for the portion of the input thus far scanned.
A canonical LR parser will not make even a single reduction before announcing the error.
SLR and LALR parsers may make several reductions before detecting an error, but they will
never shift an erroneous input symbol onto the stack.
3.12.1 PANIC-MODE ERROR RECOVERY
We can implement panic-mode error recovery by scanning down the stack until a state s with
a goto on a particular nonterminal A is found. Zero or more input symbols are then discarded
until a symbol a is found that can legitimately follow The situation might exist where there is
more than one choice for the nonterminal A. Normally these would be nonterminals
representing major program pieces, e.g. an expression, a statement, or a block. For example,
if A is the nonterminal stmt, a might be semicolon or }, which marks the end of a statement
sequence. This method of error recovery attempts to eliminate the phrase containing the
syntactic error. The parser determines that a string derivable from A contains an error. Part of
that string has already been processed, and the result of this processing is a sequence of states
on top of the stack. The remainder of the string is still in the input, and the parser attempts to
skip over the remainder of this string by looking for a symbol on the input that can
legitimately follow A. By removing states from the stack, skipping over the input, and
pushing GOTO(s, A) on the stack, the parser pretends that if has found an instance of A and
resumes normal parsing.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Department of CSE UNIT III
3.12.2 PHRASE-LEVEL RECOVERY
Phrase-level recovery is implemented by examining each error entry in the LR action table
and deciding on the basis of language usage the most likely programmer error that would
give rise to that error. An appropriate recovery procedure can then be constructed;
presumably the top of the stack and/or first input symbol would be modified in a way deemed
appropriate for each error entry. In designing specific error-handling routines for an LR
parser, we can fill in each blank entry in the action field with a pointer to an error routine that
will take the appropriate action selected by the compiler designer.
The actions may include insertion or deletion of symbols from the stack or the input or both,
or alteration and transposition of input symbols. We must make our choices so that the LR
parser will not get into an infinite loop. A safe strategy will assure that at least one input
symbol will be removed or shifted eventually, or that the stack will eventually shrink if the
end of the input has been reached. Popping a stack state that covers a non terminal should be
avoided, because this modification eliminates from the stack a construct that has already been
successfully parsed.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
UNIT IV- SYNTAX DIRECTEDTRANSLATION & RUN TIME ENVIRONMENT
SEMANTIC ANALYSIS
Semantic Analysis computes additional information related to the meaning of the
program once the syntactic structure is known.
In typed languages as C, semantic analysis involves adding information to the symbol
table and performing type checking.
The information to be computed is beyond the capabilities of standard parsing
techniques, therefore it is not regarded as syntax.
As for Lexical and Syntax analysis, also for Semantic Analysis we need both a
Representation Formalism and an Implementation Mechanism.
As representation formalism this lecture illustrates what are called Syntax Directed
Translations.
SYNTAX DIRECTED TRANSLATION
The Principle of Syntax Directed Translation states that the meaning of an input
sentence is related to its syntactic structure, i.e., to its Parse-Tree.
By Syntax Directed Translations we indicate those formalisms for specifying
translations for programming language constructs guided by context-free grammars.
o We associate Attributes to the grammar symbols representing the language
constructs.
o Values for attributes are computed by Semantic Rules associated with
grammar productions.
Evaluation of Semantic Rules may:
o Generate Code;
o Insert information into the Symbol Table;
o Perform Semantic Check;
o Issue error messages;
o etc.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
There are two notations for attaching semantic rules:
1. Syntax Directed Definitions. High-level specification hiding many implementation
details (also called Attribute Grammars).
2. Translation Schemes. More implementation oriented: Indicate the order in which
semantic rules are to be evaluated.
Syntax Directed Definitions
Syntax Directed Definitions are a generalization of context-free grammars in which:
1. Grammar symbols have an associated set of Attributes;
2. Productions are associated with Semantic Rules for computing the values of attributes.
Such formalism generates Annotated Parse-Trees where each node of the tree is a
record with a field for each attribute (e.g.,X.a indicates the attribute a of the grammar
symbol X).
The value of an attribute of a grammar symbol at a given parse-tree node is defined by
a semantic rule associated with the production used at that node.
We distinguish between two kinds of attributes:
1. Synthesized Attributes. They are computed from the values of the attributes of the
children nodes.
2. Inherited Attributes. They are computed from the values of the attributes of both the
siblings and the parent nodes
Syntax Directed Definitions: An Example
Example. Let us consider the Grammar for arithmetic expressions. The Syntax Directed
Definition associates to each non terminal a synthesized attribute called val.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
S-ATTRIBUTED DEFINITIONS
Definition. An S-Attributed Definition is a Syntax Directed Definition that uses only
synthesized attributes.
Evaluation Order. Semantic rules in a S-Attributed Definition can be evaluated by a
bottom-up, or PostOrder, traversal of the parse-tree.
Example. The above arithmetic grammar is an example of an S-Attributed
Definition. The annotated parse-tree for the input 3*5+4n is:
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
L-attributed definition
Definition: A SDD its L-attributed if each inherited attribute of Xi in the RHS of A ! X1 :
:Xn depends only on
1. attributes of X1;X2; : : : ;Xi1 (symbols to the left of Xi in the RHS)
2. inherited attributes of A.
Restrictions for translation schemes:
1. Inherited attribute of Xi must be computed by an action before Xi.
2. An action must not refer to synthesized attribute of any symbol to the right of that action.
3. Synthesized attribute for A can only be computed after all attributes it references have
been completed (usually at end of RHS).
SYMBOL TABLES
A symbol table is a major data structure used in a compiler. Associates attributes with
identifiers used in a program. For instance, a type attribute is usually associated with each
identifier. A symbol table is a necessary component Definition (declaration) of identifiers
appears once in a program .Use of identifiers may appear in many places of the program text
Identifiers and attributes are entered by the analysis phases. When processing a definition
(declaration) of an identifier. In simple languages with only global variables and implicit
declarations. The scanner can enter an identifier into a symbol table if it is not already there
In block-structured languages with scopes and explicit declarations:
The parser and/or semantic analyzer enter identifiers and corresponding attributes
Symbol table information is used by the analysis and synthesis phases
To verify that used identifiers have been defined (declared)
To verify that expressions and assignments are semantically correct type checking
To generate intermediate or target code
Symbol Table Interface
The basic operations defined on a symbol table include:
allocate to allocate a new empty symbol table
free to remove all entries and free the storage of a symbol table
insert to insert a name in a symbol table and return a pointer to its entry
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
DEPARTMENT OF CSE UNIT IV
lookup to search for a name and return a pointer to its entry
set_attribute to associate an attribute with a given entry
get_attribute to get an attribute associated with a
given entry
Other operations can be added depending on requirement For example, a delete
operation removes a name previously inserted Some identifiers become invisible
(out of scope) after exiting a block
This interface provides an abstract view of a symbol table
Supports the simultaneous existence of multiple tables
Implementation can vary without modifying the interface
Basic Implementation Techniques
First consideration is how to insert and lookup names
Variety of implementation techniques
Unordered List
Simplest to implement
Implemented as an array or a linked list
Linked list can grow dynamically alleviates problem of a fixed size array
Insertion is fast O(1), but lookup is slow for large tables O(n) on average
Ordered List
If an array is sorted, it can be searched using binary search O(log2 n)
Insertion into a sorted array is expensive O(n) on average
Useful when set of names is known in advance table of reserved words
Binary Search Tree
Can grow dynamically
Insertion and lookup are O(log2 n) on average
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
DEPARTMENT OF CSE UNIT IV
RUNTIME ENVIRONMENT
Runtime organization of different storage locations
Representation of scopes and extents during program execution.
Components of executing program reside in blocks of memory (supplied by OS).
Three kinds of entities that need to be managed at runtime:
o Generated code for various procedures and programs.
forms text or code segment of your program: size known at compile time.
o Data objects:
Global variables/constants: size known at compile time
Variables declared within procedures/blocks: size known
Variables created dynamically: size unknown.
o Stack to keep track of procedure
activations. Subdivide memory conceptually into
code and data areas:
Code:
Program instructions
Stack: Manage activation of procedures at runtime.
Heap: holds variables created dynamically
STORAGE ORGANIZATION
1. Fixed-size objects can be placed in predefined locations.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
DEPARTMENT OF CSE UNIT IV
2. Run-time stack and heap The STACK is used to store:
o Procedure activations.
o The status of the machine just before calling a procedure, so that the status can be
restored when the called procedure returns.
o The HEAP stores data allocated under program control (e.g. by malloc() in C).
Activation records
Any information needed for a single activation of a procedure is stored in
the ACTIVATION RECORD (sometimes called the STACK FRAME). Today, we’ll
assume the stack grows DOWNWARD, as on, e.g., the Intel architecture. The
activation record gets pushed for each procedure call and popped for each procedure
return.
STATIC ALLOCATION
Statically allocated names are bound to storage at compile time. Storage
bindings of statically allocated names never change, so even if a name is local to a
procedure, its name is always bound to the same storage. The compiler uses the type
of a name (retrieved from the symbol table) to determine storage size required. The
required number of bytes (possibly aligned) is set aside for the name.The address of
the storage is fixed at compile time.
Limitations:
The size required must be known at compile time.
Recursive procedures cannot be implemented as all locals are
statically allocated.
No data structure can be created dynamically as all data’s static.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
DEPARTMENT OF CSE UNIT IV
Stack-dynamic allocation
Storage is organized as a stack.
Activation records are pushed and popped.
Locals and parameters are contained in the activation records for the call.
This means locals are bound to fresh storage on every call.
If we have a stack growing downwards, we just need a stack_top pointer.
To allocate a new activation record, we just increase stack_top.
To deallocate an existing activation record, we just decrease stack_top.
Address generation in stack allocation
The position of the activation record on the stack cannot be determined statically.
Therefore the compiler must generate addresses RELATIVE to the activation record.
If we have a downward-growing stack and a stack_top pointer, we generate addresses
of the form stack_top + offset
float f(int k)
{
float c[10],b;
b = c[k]*3.14;
return b;
}
offset = 48
Local b
offset = 8
Local c[10]
offset = 4
Parameter k
offset = 0
Return value
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
DEPARTMENT OF CSE UNIT IV
HEAP ALLOCATION
Some languages do not have tree-structured allocations. In these cases,
activations have to be allocated on the heap. This allows strange situations, like
callee activations that live longer than their callers’ activations. This is not common
Heap is used for allocating space for objects created at run timeFor example: nodes of
dynamic data structures such as linked lists and trees
Dynamic memory allocation and deallocation based on the requirements
of the programmalloc() and free() in C programs
new()and delete()in C++ programs
new()and garbage collection in Java programs
Allocation and deallocation may be completely manual (C/C++), semi-automatic(Java), or
fully automatic (Lisp)
PARAMETERS PASSING
A language has first-class functionsif functions can bedeclared within any
scope passed as arguments to other functions returned as results of functions.In a
language with first-class functions and static scope, a function value is generally
represented by a closure. a pair consisting of a pointer to function code a pointer
to an activation record.Passing functions as arguments is very useful in structuring
of systems using upcalls
An example:
main()
{ int
x =
4;
int f
(int
y) {
retur
n
x*y;
}
int g (int →int h){
int x = 7;
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
DEPARTMENT OF CSE UNIT IV
return h(3) + x;
}
Call-by-Value
The actual parameters are evaluated and their
r-values
are passed to
the
called
procedure
A procedure called by value can affect its caller either through nonlocal names or
through pointers.
Parameters in C are always passed by value. Array is unusual, what is passed by value
is a pointer.
Pascal uses pass by value by default, but var parameters are passed by reference.
Call-by-Reference
Also known as call-by-address or call-by-location. The caller passes to the called
procedure the l-value of the parameter.
If the parameter is an expression, then the expression is evaluated in a new location,
and the address of the new location is passed.
Parameters in Fortran are passed by reference an old implementation bug in Fortran
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
DEPARTMENT OF CSE UNIT IV
func(a,b) { a = b};
call func(3,4); print(3);
Copy-Restore
A hybrid between call-by-value and call-by reference.
The actual parameters are evaluated and their r-values are passed as in call- by-value.
In addition, l values are determined before the call.
When control returns, the current r-values of the formal parameters are copied back
into the l-values of the actual parameters.
Call-by-Name
The actual parameters literally substituted for the formals. This is like a macro-
expansion or in-line expansion Call-by-name is not used in practice. However, the
conceptually related technique of in-line expansion is commonly used. In-lining may
be one of the most effective optimization transformations if they are guided by
execution profiles.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
UNIT V - CODE OPTIMIZATION AND
CODE
GENERATION
INTRODUCTION
The code produced by the straight forward compiling algorithms can often be made to run
faster or take less space, or both. This improvement is achieved by program transformations
that are traditionally called optimizations. Compilers that apply code-improving
transformations are called optimizing compilers.
Optimizations are classified into two categories. They are
Machine independent
optimizations:
Machine dependant
optimizations:
Machine independent optimizations:
Machine independent optimizations are program transformations that improve the target code
without taking into consideration any properties of the target machine.
Machine dependant optimizations:
Machine dependant optimizations are based on register allocation and utilization of special
machine-instruction sequences.
The criteria for code improvement transformations:
Simply stated, the best program transformations are those that yield the most benefit for the
least effort.
The transformation must preserve the meaning of programs. That is, the optimization must
not change the output produced by a program for a given input, or cause an error such as
division by zero, that was not present in the original source program. At all times we take the
safe approach of missing an opportunity to apply a transformation rather than risk
changing what the program does.
A transformation must, on the average, speed up programs by a measurable amount. We are
also interested in reducing the size of the compiled code although the size of the code has
less importance than it once had. Not every transformation succeeds in improving every
program, occasionally an optimization” may slow down a program slightly.
The transformation must be worth the effort. It does not make sense for a compiler writer to
expend the intellectual effort to implement a code improving transformation and to have the
compiler expend the additional time compiling source programs if this effort is not repaid
when the target programs are executed. Peephole transformations of this kind are simple
enough and beneficial enough to be included in any compiler.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Organization for an Optimizing Compiler:
Flow analysis is a fundamental prerequisite for many important types of code
improvement.
Generally control flow analysis precedes data flow analysis.
Control flow analysis (CFA) represents flow of control usually in form of graphs, CFA
constructs such as
control flow graph
Call graph
Data flow analysis (DFA) is the process of ascerting and collecting information prior to
program execution about the possible modification, preservation, and use of certain
entities (such as values or attributes of variables) in a computer program.
PRINCIPAL SOURCES OF OPTIMISATION
A transformation of a program is called local if it can be performed by looking only at the
statements in a basic block; otherwise, it is called global.
Many transformations can be performed at both the local and global levels. Local
transformations are usually performed first.
Function-Preserving Transformations
There are a number of ways in which a compiler can improve a program without
changing the function it computes.
The transformations
Common sub expression elimination,
Copy propagation,
Dead-code elimination, and
Constant folding
are common examples of such function-preserving transformations. The other
transformations come up primarily when global optimizations are performed.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Frequently, a program will include several calculations of the same value, such as an
offset in an array. Some of the duplicate calculations cannot be avoided by the
programmer because they lie below the level of detail accessible within the source
language.
Common Sub expressions elimination:
An occurrence of an expression E is called a common sub-expression if E was previously
computed, and the values of variables in E have not changed since the previous
computation. We can avoid recomputing the expression if we can use the previously
computed value.
For
example
t
1
: = 4*i
t
2
: = a [t1]
t
3
: = 4*j
t
4
: = 4*i
t
5
: = n
t
6
: = b [t
4
] +t
5
The above code can be optimized using the common sub-expression elimination as
t
1
: = 4*i
t
2
: = a [t
1
]
t
3
: = 4*j
t
5
: = n
t
6
: = b [t
1
] +t
5
The common sub expression t
4
: =4*i is eliminated as its computation is already in t
1
. And
value of i is not been changed from definition to use.
Copy Propagation:
Assignments of the form f : = g called copy statements, or copies for short. The idea
behind the copy-propagation transformation is to use g for f, whenever possible after the
copy statement f: = g. Copy propagation means use of one variable instead of another.
This may not appear to be an improvement, but as we shall see it gives us an opportunity
to eliminate x.
For
example:
x=Pi;
……
A=x*r*r;
The optimization using copy propagation can be done as follows:
A=Pi*r*r;
Here the variable x is eliminated
Dead-Code Eliminations:
A variable is live at a point in a program if its value can be used subsequently; otherwise,
it is dead at that point. A related idea is dead or useless code, statements that compute
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
values that never get used. While the programmer is unlikely to introduce any dead code
intentionally, it may appear as the result of previous transformations. An optimization can
be done by eliminating dead code.
Example:
i=0;
if(i=1)
{
a=b+5;
}
Here, „if‟ statement is dead code because this condition will never get satisfied.
Constant
folding
:
We can eliminate both the test and printing from the object code. More generally,
deducing at compile time that the value of an expression is a constant and using the
constant instead is known as constant folding.
One advantage of copy propagation is that it often turns the copy statement into dead
code.
For example,
a=3.14157/2 can be replaced by
a=1.570 there by eliminating a division operation.
Loop Optimizations:
We now give a brief introduction to a very important place for optimizations, namely
loops, especially the inner loops where programs tend to spend the bulk of their time. The
running time of a program may be improved if we decrease the number of instructions in
an inner loop, even if we increase the amount of code outside that loop.
Three techniques are important for loop optimization:
code motion, which moves code outside a loop;
Induction-variable elimination, which we apply to replace variables from inner loop.
Reduction in strength, which replaces and expensive operation by a cheaper one, such
as
a
multiplication by an addition.
Code Motion:
An important modification that decreases the amount of code in a loop is code motion.
This transformation takes an expression that yields the same result independent of the
number of times a loop is executed ( a loop-invariant computation) and places the
expression before the loop. Note that the notion before the loop” assumes the existence
of an entry for the loop. For example, evaluation of limit-2 is a loop-invariant
computation in the following while-statement:
while (i <= limit-2) /* statement does not change limit*/
Code motion will result in the equivalent of
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
t= limit-2;
while (i<=t) /* statement does not change limit or t */
Induction Variables :
Loops are usually processed inside out. For example consider the loop around B3.
Note that the values of j and t
4
remain in lock-step; every time the value of j decreases by
1, that of t
4
decreases by 4 because 4*j is assigned to t
4
. Such identifiers are called
induction variables.
When there are two or more induction variables in a loop, it may be possible to get rid of
all but one, by the process of induction-variable elimination. For the inner loop around
B3 in Fig. we cannot get rid of either j or t
4
completely; t
4
is used in B3 and j in
B4.
However,
we can illustrate reduction in strength and illustrate a part of the process of
induction-variable elimination. Eventually j will be eliminated when the outer loop of B2
- B5 is considered.
Example:
As the relationship t
4
:=4*j surely holds after such an assignment to t
4
in Fig. and t
4
is not
changed elsewhere in the inner loop around B3, it follows that just after the statement
j:=j-1 the relationship t
4
:= 4*j-4 must hold. We may therefore replace the assignment t
4
:=
4*j by t
4
:= t
4
-4. The only problem is that t
4
does not have a value when we enter block B3
for the first time. Since we must maintain the relationship t
4
=4*j on entry to the block B3,
we place an initializations of t
4
at the end of the block where j itself is
before
after
initialized, shown by the dashed addition to block B1 in second Fig.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY COURSE MATERIAL
The replacement of a multiplication by a subtraction will speed up the object code if
multiplication takes more time than addition or subtraction, as is the case on many
machines.
Reduction In Strength:
Reduction in strength replaces expensive operations by equivalent cheaper ones on the
target machine. Certain machine instructions are considerably cheaper than others and
can often be used as special cases of more expensive operators.
For example, x² is invariably cheaper to implement as x*x than as a call to an
exponentiation routine. Fixed-point multiplication or division by a power of two is
cheaper to implement as a shift. Floating-point division by a constant can be implemented
as multiplication by a constant, which may be cheaper.
OPTIMIZATION OF BASIC BLOCKS
There are two types of basic block optimizations. They are :
Structure-Preserving Transformations
Algebraic Transformations
Structure-Preserving Transformations:
The primary Structure-Preserving Transformation on basic blocks are:
Common sub-expression elimination
Dead code elimination
Renaming of temporary variables
Interchange of two independent adjacent statements.
Common sub-expression elimination:
Common sub expressions need not be computed over and over again. Instead they can be
computed once and kept in store from where it‟s referenced when encountered again of course
providing the variable values in the expression still remain constant.
Example:
a: =b+c
b: =a-d
c: =b+c
d: =a-d
The 2
nd
and 4
th
statements compute the same expression: b+c and a-d
Basic block can be transformed to
a: = b+c
b: = a-d
c: = a
d: = b
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Dead code elimination:
It‟s possible that a large amount of dead (useless) code may exist in the program. This
might be especially caused when introducing variables and procedures as part of
constructio
n or
error-correction of a program once declared and defined, one forgets to remove them in case
they serve no purpose. Eliminating these will definitely optimize the code.
Renaming of temporary variables:
A statement t:=b+c where t is a temporary name can be changed to u:=b+c where u is
another temporary name, and change all uses of t to u.
In this we can transform a basic block to its equivalent block called normal-form block.
Interchange of two independent adjacent statements:
Two statements
t
1
:=b+c
t
2
:=x+y
can be interchanged or reordered in its computation in the basic block when value of
t
1
does not affect the value of t
2
.
Algebraic Transformations:
Algebraic identities represent another important class of optimizations on basic blocks.
This includes simplifying expressions or replacing expensive operation by cheaper ones
i.e. reduction in strength.
Another class of related optimizations is constant folding. Here we evaluate constant
expressions at compile time and replace the constant expressions by their values. Thus
the expression 2*3.14 would be replaced by 6.28.
The relational operators <=, >=, <, >, + and = sometimes generate unexpected common
sub expressions.
Associative laws may also be applied to expose common sub expressions. For example, if
the source code has the assignments
a :=b+c
e :=c+d+b
the following intermediate code may be generated:
a :=b+c
t :=c+d
e :=t+b
Example:
x:=x+0 can be removed
x:=y**2 can be replaced by a cheaper statement x:=y*y
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
The compiler writer should examine the language carefully to determine what
rearrangements of computations are permitted, since computer arithmetic does not always
obey the algebraic identities of mathematics. Thus, a compiler may evaluate x*y-x*z as
x*(y-z) but it may not evaluate a+(b-c) as (a+b)-c.
LOOPS IN FLOW GRAPH
A graph representation of three-address statements, called a flow graph, is useful for
understanding code-generation algorithms, even if the graph is not explicitly constructed by a
code-generation algorithm. Nodes in the flow graph represent computations, and the edges
represent the flow of control.
Dominators:
In a flow graph, a node d dominates node n, if every path from initial node of the flow
graph to n goes through d. This will be denoted by d dom n. Every initial node dominates all the
remaining nodes in the flow graph and the entry of a loop dominates all nodes in the loop.
Similarly every node dominates itself.
Example:
*In the flow graph below,
*Initial node,node1 dominates every node.
*node 2 dominates itself
*node 3 dominates all but 1 and 2.
*node 4 dominates all but 1,2 and 3.
*node 5 and 6 dominates only themselves,since flow of control can skip around either by goin
through the other.
*node 7 dominates 7,8 ,9 and 10.
*node 8 dominates 8,9 and 10.
*node 9 and 10 dominates only themselves.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
SRI VIDYA COLLEGE OF ENGINEERING AND TECHNOLOGY COURSE MATERIAL
The way of presenting dominator information is in a tree, called the dominator tree in
which the initial node is the root.
The parent of each other node is its immediate dominator.
Each node d dominates only its descendents in the tree.
The existence of dominator tree follows from a property of dominators; each node has a
unique immediate dominator in that is the last dominator of n on any path from the initial
node to n.
In terms of the dom relation, the immediate dominator m has the property is d=!n and d
dom n, then d dom m.
D(1)={1} D(2)={1,2}
D(3)={1,3}
D(4)={1,3,4}
D(5)={1,3,4,5}
D(6)={1,3,4,6}
D(7)={1,3,4,7}
D(8)={1,3,4,7,8}
D(9)={1,3,4,7,8,9}
D(10)={1,3,4,7,8,10}
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Natural Loop:
One application of dominator information is in determining the loops of a flow graph suitable
for improvement.
The properties of loops are
A loop must have a single entry point, called the header. This entry point-dominates all
nodes in the loop, or it would not be the sole entry to the loop.
There must be at least one way to iterate the loop(i.e.)at least one path back to the header.
One way to find all the loops in a flow graph is to search for edges in the flow graph whose
heads dominate their tails. If a→b is an edge, b is the head and a is the tail. These types of
edges are called as back edges.
Example:
In the above graph,
7 4 4 DOM 7
10 →7 7 DOM 10
4 3
8 3
9 →1
The above edges will form loop in flow graph.
Given a back edge n d, we define the natural loop of the edge to be d plus the set of nodes
that can reach n without going through d. Node d is the header of the loop.
Algorithm: Constructing the natural loop of a back edge.
Input: A flow graph G and a back edge n→d.
Output: The set loop consisting of all nodes in the natural loop n→d.
Method: Beginning with node n, we consider each node m*d that we know is in loop, to make
sure that m‟s predecessors are also placed in loop. Each node in loop, except for d, is placed once
on stack, so its predecessors will be examined. Note that because d is put in the loop initially, we
never examine its predecessors, and thus find only those nodes that reach n without going
through d.
Procedure insert(m);
if m is not in loop then begin
loop := loop U {m};
push m onto stack
end;
stack : = empty;
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
loop : = {d};
insert(n);
while stack is not empty do begin
pop m, the first element of stack, off stack;
for each predecessor p of m do insert(p)
end
Inner loop:
If we use the natural loops as the loops, then we have the useful property that unless
two loops have the same header, they are either disjointed or one is entirely contained in
the other. Thus, neglecting loops with the same header for the moment, we have a natural
notion of inner loop: one that contains no other loop.
When two natural loops have the same header, but neither is nested within the other, they
are combined and treated as a single loop.
Pre-Headers:
Several transformations require us to move statements before the header. Therefore
begin treatment of a loop L by creating a new block, called the preheater.
The pre-header has only the header as successor, and all edges which formerly entered
the header of L from outside L instead enter the pre-header.
Edges from inside loop L to the header are not changed.
Initially the pre-header is empty, but transformations on L may place statements in it.
header
pre-header
loop L
header
loop
L
(a) Before (b) After
Reducible flow graphs:
Reducible flow graphs are special flow graphs, for which several code optimization
transformations are especially easy to perform, loops are unambiguously defined,
dominators can be easily calculated, data flow analysis problems can also be solved
efficiently.
Exclusive use of structured flow-of-control statements such as if-then-else, while-do,
continue, and break statements produces programs whose flow graphs are always
reducible.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
The most important properties of reducible flow graphs are that there are no jumps into
the middle of loops from outside; the only entry to a loop is through its header.
Definition:
A flow graph G is reducible if and only if we can partition the edges into two disjoint
groups, forward edges and back edges, with the following properties.
The forward edges from an acyclic graph in which every node can be reached from initial
node of G.
The back edges consist only of edges where heads dominate theirs tails.
Example: The above flow graph is reducible.
If we know the relation DOM for a flow graph, we can find and remove all the back
edges.
The remaining edges are forward edges.
If the forward edges form an acyclic graph, then we can say the flow graph reducible.
In the above example remove the five back edges 4→3, 7→4, 8→3, 9→1 and 107
whose heads dominate their tails, the remaining graph is acyclic.
The key property of reducible flow graphs for loop analysis is that in such flow graphs
every set of nodes that we would informally regard as a loop must contain a back edge.
PEEPHOLE OPTIMIZATION
A statement-by-statement code-generations strategy often produce target code that
contains redundant instructions and suboptimal constructs .The quality of such target
code can be improved by applying optimizing transformations to the target program.
A simple but effective technique for improving the target code is peephole optimization,
a method for trying to improving the performance of the target program by examining a
short sequence of target instructions (called the peephole) and replacing these
instructions by a shorter or faster sequence, whenever possible.
The peephole is a small, moving window on the target program. The code in the peephole
need not contiguous, although some implementations do require this.it is characteristic of
peephole optimization that each improvement may spawn opportunities for additional
improvements.
We shall give the following examples of program transformations that are characteristic
of peephole optimizations:
Redundant-instructions elimination
Flow-of-control optimizations
Algebraic simplifications
Use of machine idioms
Unreachable Code
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Redundant Loads And Stores:
If we see the instructions sequence
(1) MOV
R
0
,a
(2) MOV
a,R
0
we can delete instructions (2) because whenever (2) is executed. (1) will ensure that the value
of
a
is already in register R
0
.If (2) had a label we could not be sure that (1) was always executed
immediately before (2) and so we could not remove (2).
Unreachable Code:
Another opportunity for peephole optimizations is the removal of unreachable instructions.
An unlabeled instruction immediately following an unconditional jump may be removed.
This operation can be repeated to eliminate a sequence of instructions. For
ex
ample, for
debugging purposes, a large program may have within it certain segments that are executed
only if a variable debug is 1. In C, the source code might look like:
#define debug 0
….
If ( debug ) {
Print debugging information
}
In the intermediate representations the if-statement may be translated as:
If debug =1 goto L2
goto L2
L1: print debugging information
L2: …………………………(a)
One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter what
the value of debug; (a) can be replaced by:
If debug 1 goto L2
Print debugging information
L2: ……………………………(b)
As the argument of the statement of (b) evaluates to a constant true it can be replaced
by
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
If debug 0 goto L2
Print debugging information
L2: ……………………………(c)
As the argument of the first statement of (c) evaluates to a constant true, it can be replaced by
goto L2. Then all the statement that print debugging aids are manifestly unreachable and
can be eliminated one at a time.
Flows-Of-Control Optimizations:
The unnecessary jumps can be eliminated in either the intermediate code or
th
e target code
by the following types of peephole optimizations. We can replace the jump sequence
goto
L1
….
L1: gotoL2
by the sequence
goto L2
….
L1: goto
L2
If there are now no jumps to L1, then it may be possible to eliminate the statement L1:goto
L2 provided it is preceded by an unconditional jump .Similarly, the sequence
if a < b goto L1
….
L1: goto L2
can be replaced by
If a < b goto L2
….
L1: goto L2
Finally, suppose there is only one jump to L1 and L1 is preceded by an unconditional goto.
Then the
sequence
goto L1
……..
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
L1: if a < b goto L2
L3: …………………………………..(1)
May be replaced by
If a < b goto L2
goto L3
…….
L3: ………………………………….(2)
While the number of instructions in (1) and (2) is the same, we sometimes skip the
unconditional jump in (2), but never in (1).Thus (2) is superior to (1) in execution time
Algebraic
Simplification:
There is no end to the amount of algebraic simplification that can be attempted through
peephole optimization. Only a few algebraic identities occur frequently enough that it is
worth considering implementing them .For example, statements such as
x := x+0
Or
x := x * 1
Are often produced by straightforward intermediate code-generation algorithms, and they can
be eliminated easily through peephole optimization.
Reduction in
Strength:
Reduction in strength replaces expensive operations by equivalent cheaper ones on the target
machine. Certain machine instructions are considerably cheaper than others and can often be
used as special cases of more expensive operators.
For example, x² is invariably cheaper to implement as x*x than as a call to an exponentiation
routine. Fixed-point multiplication or division by a power of two is cheaper to implement
as
a
shift. Floating-point division by a constant can be implemented as multiplication by a
constant, which may be cheaper.
X
2
X*X
Use of Machine
Idioms:
The target machine may have hardware instructions to implement certain specific operations
efficiently. For example, some machines have auto-increment and auto-decrement addressing
modes. These add or subtract one from an operand before or after using its value.
The use of these modes greatly improves the quality of code when pushing or popping a
stack, as in parameter passing. These modes can also be used in code for statements like i :
=i+1.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
i:=i+1 i++
i:=i-1 i- -
INTRODUCTION TO GLOBAL DATAFLOW ANALYSIS
In order to do code optimization and a good job of code generation , compiler needs to
collect information about the program as a whole and to distribute this information to
each block in the flow graph.
A compiler could take advantage of reaching definitions” , such as knowing where a
variable like debug was last defined before reaching a given block, in order to perform
transformations are just a few examples of data-flow information that an optimizing
compiler collects by a process known as data-flow analysis.
Data-flow information can be collected by setting up and solving systems of equations of
the form :
out [S] = gen [S] U ( in [S] kill [S] )
This equation can be read as the information at the end of a statement is either generated
within the statement , or enters at the beginning and is not killed as control flows through
the statement.”
The details of how data-flow equations are set and solved depend on three factors.
The notions of generating and killing depend on the desired information, i.e., on the data
flow analysis problem to be solved. Moreover, for some problems, instead of proceeding
along with flow of control and defining out[s] in terms of in[s], we need to proceed
backwards and define in[s] in terms of out[s].
Since data flows along control paths, data-flow analysis is affected by the constructs in a
program. In fact, when we write out[s] we implicitly assume that there is unique
end
point
where control leaves the statement; in general, equations are set up at the level of
basic blocks rather than statements, because blocks do have unique end points.
There are subtleties that go along with such statements as procedure calls, assignments
through pointer variables, and even assignments to array variables.
Points and Paths:
Within a basic block, we talk of the point between two adjacent statements, as well as the
point before the first statement and after the last. Thus, block B1 has four points: one
before any of the assignments and one after each of the three assignments.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
B1
d1 : i :=m-1
d2: j :=n
d3: a := u1
B2
d4 : I := i+1
B3
d5: j := j-1
B4
B5 B6
d6 :a :=u2
Now let us take a global view and consider all the points in all the blocks. A path from
p
1
to p
n
is a sequence of points p
1
, p
2
,….,p
n
such that for each i between 1 and n-1,
either
P
i
is the point immediately preceding a statement and p
i+1
is the point immediately
following that statement in the same block, or
P
i
is the end of some block and p
i+1
is the beginning of a successor block.
Reaching definitions:
A definition of variable x is a statement that assigns, or may assign, a value to x. The
most common forms of definition are assignments to x and statements that read a value
from an i/o device and store it in x.
These statements certainly define a value for x, and they are referred to as unambiguous
definitions of x. There are certain kinds of statements that may define a value for x; they
are called ambiguous definitions. The most usual forms of ambiguous definitions of x
are:
A call of a procedure with x as a parameter or a procedure that can access x because x is
in the scope of the procedure.
An assignment through a pointer that could refer to x. For example, the assignment
*
q: = y
is a definition of x if it is possible that q points to x. we must assume that an assignment
through a pointer is a definition of every variable.
We say a definition d reaches a point p if there is a path from the point immediately
following d to p, such that d is not killed” along that path. Thus a point can be reached
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
by an unambiguous definition and an ambiguous definition of the same variable
appearing later along one path.
Data-flow analysis of structured programs:
Flow graphs for control flow constructs such as do-while statements have a useful
property: there is a single beginning point at which control enters and a single end point
that control leaves from when execution of the statement is over. We exploit this property
when we talk of the definitions reaching the beginning and the end of statements
wit
h the
following syntax.
S id: = E| S; S | if E then S else S | do S while E
E id + id| id
Expressions in this language are similar to those in the intermediate code, but the flow
graphs for statements have restricted forms.
S1
S1
If E goto s1
S2
S1 S2
If E goto s1
S1 ; S2
IF E then S1 else S2 do S1 while E
We define a portion of a flow graph called a region to be a set of nodes N that includes a
header, which dominates all other nodes in the region. All edges between nodes in N are
in the region, except for some that enter the header.
The portion of flow graph corresponding to a statement S is a region that obeys the
further restriction that control can flow to just one outside block when it leaves the
region.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
We say that the beginning points of the dummy blocks at the entry and exit of a
statement‟s region are the beginning and end points, respectively, of the statement. The
equations are inductive, or syntax-directed, definition of the sets in[S], out[S], gen[S],
and kill[S] for all statements S.
gen[S] is the set of definitions “generated by S while kill[S] is the set of definitions
that never reach the end of S.
Consider the following data-flow equations for reaching definitions :
i )
S
d : a : = b + c
gen [S] = { d }
kill [S] = D
a
{ d }
out [S] = gen [S] U ( in[S] kill[S] )
Observe the rules for a single assignment of variable a. Surely that assignment is a
definition of a, say d. Thus
Gen[S]={d}
On the other hand, d “kills” all other definitions of a, so we write
Kill[S] = D
a
{d}
Where, D
a
is the set of all definitions in the program for variable a. ii )
S
S
1
S
2
gen[S]=gen[S
2
] U (gen[S
1
]-kill[S
2
])
Kill[S] = kill[S
2
] U (kill[S
1
] gen[S
2
])
in [S
1
] = in [S]
in [S
2
] = out [S
1
]
out [S] = out [S
2
]
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Under what circumstances is definition d generated by S=S
1
; S
2
? First of all, if it is
generated by S
2
, then it is surely generated by S. if d is generated by S
1
, it will reach the
end of S provided it is not killed by S
2
. Thus, we write
gen[S]=gen[S
2
] U (gen[S
1
]-kill[S
2
]
Similar reasoning applies to the killing of a definition, so we have
Kill[S] = kill[S
2
] U (kill[S
1
] gen[S
2
])
Conservative estimation of data-flow information:
There is a subtle miscalculation in the rules for gen and kill. We have made the
assumption that the conditional expression E in the if and do statements are
uninterpreted; that is, there exists inputs to the program that make their branches go
either way.
We assume that any graph-theoretic path in the flow graph is also an execution path,
i.e.,
a
path that is executed when the program is run with least one possible input.
When we compare the computed gen with the true gen we discover that the true gen is
always a subset of the computed gen. on the other hand, the true kill is always a superset
of the computed kill.
These containments hold even after we consider the other rules. It is natural to wonder
whether these differences between the true and computed gen and kill sets present a
serious obstacle to data-flow analysis. The answer lies in the use intended for these data.
Overestimating the set of definitions reaching a point does not seem serious; it merely
stops us from doing an optimization that we could legitimately do. On the other hand,
underestimating the set of definitions is a fatal error; it could lead us into making a
change in the program that changes what the program computes. For the case of reaching
definitions, then, we call a set of definitions safe or conservative if the estimate is a
superset of the true set of reaching definitions. We call the estimate unsafe, if it is not
necessarily a superset of the truth.
Returning now to the implications of safety on the estimation of gen and kill for reaching
definitions, note that our discrepancies, supersets for gen and subsets for kill are both
in
the
safe direction. Intuitively, increasing gen adds to the set of definitions that can reach a
point, and cannot prevent a definition from reaching a place that it truly reached.
Decreasing kill can only increase the set of definitions reaching any given point.
Computation of in and out:
Many data-flow problems can be solved by synthesized translations similar to those used
to compute gen and kill. It can be used, for example, to determine loop-invariant
computations.
However, there are other kinds of data-flow information, such as the reaching-definitions
problem. It turns out that in is an inherited attribute, and out is a synthesized attribute
depending on in. we intend that in[S] be the set of definitions reaching the beginning
of
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
S,
taking into account the flow of control throughout the entire program, including
statements outside of S or within which S is nested.
The set out[S] is defined similarly for the end of s. it is important to note the distinction
between out[S] and gen[S]. The latter is the set of definitions that reach the end of S
without following paths outside S.
Assuming we know in[S] we compute out by equation, that is
Out[S] = gen[S] U (in[S] - kill[S])
Considering cascade of two statements S
1
; S
2
, as in the second case. We start by
observing in[S
1
]=in[S]. Then, we recursively compute out[S
1
], which gives us in[S
2
],
since a definition reaches the beginning of S
2
if and only if it reaches the end of S
1
. Now
we can compute out[S
2
], and this set is equal to out[S].
Considering if-statement we have conservatively assumed that control can follow either
branch, a definition reaches the beginning of S
1
or S
2
exactly when it reaches the
beginning of S.
In[S
1
] = in[S
2
] = in[S]
If a definition reaches the end of S if and only if it reaches the end of one or both sub
statements; i.e,
Out[S]=out[S
1
] U out[S
2
]
Representation of sets:
Sets of definitions, such as gen[S] and kill[S], can be represented compactly using bit
vectors. We assign a number to each definition of interest in the flow graph. Then bit
vector representing a set of definitions will have 1 in position I if and only if the
definition numbered I is in the set.
The number of definition statement can be taken as the index of statement in an array
holding pointers to statements. However, not all definitions may be of interest during
global data-flow analysis. Therefore the number of definitions of interest will typically be
recorded in a separate table.
A bit vector representation for sets also allows set operations to be implemented
efficiently. The union and intersection of two sets can be implemented by logical or and
logical and, respectively, basic operations in most systems-oriented programming
languages. The difference A-B of sets A and B can be implemented by taking the
complement of B and then using logical and to compute A
.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Local reaching definitions:
Space for data-flow information can be traded for time, by saving information only at
certain points and, as needed, recomputing information at intervening points. Basic
blocks are usually treated as a unit during global flow analysis, with attention restricted to
only those points that are the beginnings of blocks.
Since there are usually many more points than blocks, restricting our effort to blocks is a
significant savings. When needed, the reaching definitions for all points in a block can be
calculated from the reaching definitions for the beginning of a block.
Use-definition chains:
It is often convenient to store the reaching definition information as” use-definition
chains” or ud-chains, which are lists, for each use of a variable, of all the
definitions
that
reaches that use. If a use of variable a in block B is preceded by no unambiguous
definition of a, then ud-chain for that use of a is the set of definitions in in[B] that are
definitions of a.in addition, if there are ambiguous definitions of a ,then all of these for
which no unambiguous definition of a lies between it and the use of a are on the
ud
-chain
for this use of a.
Evaluation order:
The techniques for conserving space during attribute evaluation, also apply to the
computation of data-flow information using specifications. Specifically, the only
constraint on the evaluation order for the gen, kill, in and out sets for statements is that
imposed by dependencies between these sets. Having chosen an evaluation order, we are
free to release the space for a set after all uses of it have occurred.
Earlier circular dependencies between attributes were not allowed, but we have seen that
data-flow equations may have circular dependencies.
General control flow:
Data-flow analysis must take all control paths into account. If the control paths are
evident from the syntax, then data-flow equations can be set up and solved in a syntax-
directed manner.
When programs can contain goto statements or even the more disciplined break and
continue statements, the approach we have taken must be modified to take the actual
control paths into account.
Several approaches may be taken. The iterative method works arbitrary flow graphs.
Since the flow graphs obtained in the presence of break and continue statements are
reducible, such constraints can be handled systematically using the
interval
-based
methods
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
However, the syntax-directed approach need not be abandoned when break and continue
statements are allowed.
CODE GENERATION
The final phase in compiler model is the code generator. It takes as input an intermediate
representation of the source program and produces as output an equivalent target program. The
code generation techniques presented below can be used whether or not an optimizing phase
occurs before code generation.
Position of code
generator
source
front end
intermediate
code
intermediate
code
target
program code
optimizer
code
generator
program
symbol
table
ISSUES IN THE DESIGN OF A CODE GENERATOR
The following issues arise during the code generation phase :
1. Input to code generator
2. Target program
3. Memory management
4. Instruction selection
5. Register allocation
6. Evaluation order
1. Input to code generator:
The input to the code generation consists of the intermediate representation of the source
program produced by front end , together with information in the symbol table to
determine run-time addresses of the data objects denoted by the names in the
intermediate representation.
Intermediate representation can be :
a. Linear representation such as postfix notation
b. Three address representation such as quadruples
c. Virtual machine representation such as stack machine code
d. Graphical representations such as syntax trees and dags.
Prior to code generation, the front end must be scanned, parsed and translated into
intermediate representation along with necessary type checking. Therefore, input to code
generation is assumed to be error-free.
2. Target program:
The output of the code generator is the target program. The output may be :
a. Absolute machine language
- It can be placed in a fixed memory location and can be executed immediately.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
b. Relocatable machine language
- It allows subprograms to be compiled separately.
c. Assembly language
- Code generation is made easier.
3. Memory management:
Names in the source program are mapped to addresses of data objects in run-time
memory by the front end and code generator.
It makes use of symbol table, that is, a name in a three-address statement refers to a
symbol-table entry for the name.
Labels in three-address statements have to be converted to addresses of instructions.
For example,
j : goto i generates jump instruction as follows :
if i < j, a backward jump instruction with target address equal to location of
code for quadruple i is generated.
if i > j, the jump is forward. We must store on a list for quadruple i the
location of the first machine instruction generated for quadruple j. When i is
processed, the machine locations for all instructions that forward jumps to i
are filled.
4. Instruction selection:
The instructions of target machine should be complete and uniform.
Instruction speeds and machine idioms are important factors when efficiency of target
program is considered.
The quality of the generated code is determined by its speed and size.
The former statement can be translated into the latter statement as shown below:
5. Register allocation
Instructions involving register operands are shorter and faster than those involving
operands in memory.
The use of registers is subdivided into two subproblems :
Register allocation the set of variables that will reside in registers at a point in
the program is selected.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
Register assignment the specific register that a variable will reside in is
picked
Certain machine requires even-odd register pairs for some operands and results.
For example , consider the division instruction of the form :
D x, y
where, x dividend even register in even/odd register pair
y divisor
even register holds the remainder
odd register holds the quotient
6. Evaluation order
The order in which the computations are performed can affect the efficiency of the
target code. Some computation orders require fewer registers to hold intermediate
results than others.
TARGET MACHINE
Familiarity with the target machine and its instruction set is a prerequisite for designing a
good code generator.
The target computer is a byte-addressable machine with 4 bytes to a word.
It has n general-purpose registers, R
0
, R
1
, . . . , R
n-1
.
It has two-address instructions of the form:
op source,
destination
where, op is an op-code, and source and destination are data fields.
It has the following op-codes :
MOV (move source to destination)
ADD (add source to destination)
SUB (subtract source from destination)
The source and destination of an instruction are specified by combining registers and
memory locations with address modes.
Address modes with their assembly-language
forms
MODE
FORM
ADDRESS
ADDED COST
absolute
M
M
1
register
R
R
0
indexed
c(R)
c+contents(R)
1
indirect register
*R
contents (R)
0
indirect indexed
*c(R)
contents(c+
contents(R))
1
literal
#c
c
1
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
For example : MOV R
0
, M stores contents of Register R
0
into memory location M ;
MOV 4(R
0
), M stores the value contents(4+contents(R
0
)) into M.
Instruction costs :
Instruction cost = 1+cost for source and destination address modes. This cost corresponds
to the length of the instruction.
Address modes involving registers have cost zero.
Address modes involving memory location or literal have cost one.
Instruction length should be minimized if space is important. Doing so also minimizes the
time taken to fetch and perform the instruction.
For example : MOV R0, R1 copies the contents of register R0 into R1. It has cost one,
since it occupies only one word of memory.
The three-address statement a : = b + c can be implemented by many different instruction
sequences :
i) MOV b, R
0
ADD c, R
0
cost = 6
MOV R
0
,
a
ii) MOV b, a
ADD c, a cost = 6
iii) Assuming R
0
, R
1
and R
2
contain the addresses of a, b, and c :
MOV *R
1
, *R
0
ADD *R
2
, *R
0
cost = 2
In order to generate good code for target machine, we must utilize its addressing
capabilities efficiently.
RUN-TIME STORAGE MANAGEMENT
Information needed during an execution of a procedure is kept in a block of storage
called an activation record, which includes storage for names local to the procedure.
The two standard storage allocation strategies are:
1. Static allocation
2. Stack allocation
In static allocation, the position of an activation record in memory is fixed at compile
time.
In stack allocation, a new activation record is pushed onto the stack for each execution
of
a
procedure. The record is popped when the activation ends.
The following three-address statements are associated with the run-time allocation and
deallocation of activation records:
1. Call,
2. Return,
3. Halt, and
4. Action, a placeholder for other statements.
We assume that the run-time memory is divided into areas for:
1. Code
2. Static data
3. Stack
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
Static allocation
Implementation of call statement:
The codes needed to implement static allocation are as follows:
MOV #here + 20, callee.static_area /*It saves return address*/
GOTO callee.code_area /*It transfers control to the target code for the called procedure */
where,
callee.static_area Address of the activation record
callee.code_area Address of the first instruction for called procedure
#here + 20 Literal return address which is the address of the instruction following GOTO.
Implementation of return statement:
A return from procedure callee is implemented by :
GOTO *callee.static_area
This transfers control to the address saved at the beginning of the activation record.
Implementation of action statement:
The instruction ACTION is used to implement action statement.
Implementation of halt statement:
The statement HALT is the final instruction that returns control to the operating system.
Stack allocation
Static allocation can become stack allocation by using relative addresses for storage in
activation records. In stack allocation, the position of activation record is stored in register so
words in activation records can be accessed as offsets from the value in this register.
The codes needed to implement stack allocation are as follows:
Initialization of stack:
MOV #stackstart , SP /* initializes stack */
Code for the first procedure
HALT /* terminate execution */
Implementation of Call statement:
ADD #caller.recordsize, SP /* increment stack pointer */
MOV #here + 16, *SP /*Save return address */
GOTO callee.code_area
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
where,
caller.recordsize size of the activation record
#here + 16 address of the instruction following the GOTO
Implementation of Return statement:
GOTO *0 ( SP ) /*return to the caller */
SUB #caller.recordsize, SP /* decrement SP and restore to previous value */
BASIC BLOCKS AND FLOW GRAPHS
Basic Blocks
A basic block is a sequence of consecutive statements in which flow of control enters at
the beginning and leaves at the end without any halt or possibility of branching except at
the end.
The following sequence of three-address statements forms a basic block:
t
1
: = a * a
t
2
: = a * b
t
3
: = 2 * t
2
t
4
: = t
1
+ t
3
t
5
: = b * b
t
6
: = t
4
+ t
5
Basic Block Construction:
Algorithm: Partition into basic blocks
Input: A sequence of three-address statements
Output: A list of basic blocks with each three-address statement in exactly one block
Method:
1. We first determine the set of leaders, the first statements of basic blocks. The rules
we use are of the following:
a. The first statement is a leader.
b. Any statement that is the target of a conditional or unconditional goto is a
leader.
c. Any statement that immediately follows a goto or conditional goto statement
is a leader.
2. For each leader, its basic block consists of the leader and all statements up to but not
including the next leader or the end of the program.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
Consider the following source code for dot product of two vectors a and b of length 20
begin
prod :=0;
i:=1;
do begin
prod :=prod+ a[i] * b[i];
i :=i+1;
end
while i <= 20
end
The three-address code for the above source program is given as :
(1) prod := 0
(2) i := 1
(3) t
1
:= 4* i
(4) t
2
:= a[t
1
] /*compute a[i] */
(5) t
3
:= 4* i
(6) t
4
:= b[t
3
] /*compute b[i] */
(7) t
5
:= t
2
*t
4
(8) t
6
:= prod+t
5
(9) prod := t
6
(10) t
7
:= i+1
(11) i := t
7
(12) if i<=20 goto (3)
Basic block 1: Statement (1) to (2)
Basic block 2: Statement (3) to (12)
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
Transformations on Basic Blocks:
A number of transformations can be applied to a basic block without changing the set of
expressions computed by the block. Two important classes of transformation are :
Structure-preserving
transformations
Algebraic
transformations
1. Structure preserving transformations:
a) Common subexpression elimination:
a : = b + c a : = b + c
b : = a d b : = a - d
c : = b + c c : = b + c
d : = a d d : = b
Since the second and fourth expressions compute the same expression, the basic block can be
transformed as above.
b) Dead-code elimination:
Suppose x is dead, that is, never subsequently used, at the point where the statement x : =
y + z appears in a basic block. Then this statement may be safely removed without
changing
the
value of the basic block.
c) Renaming temporary variables:
A statement t : = b + c ( t is a temporary ) can be changed to u : = b + c (u is a new
temporary) and all uses of this instance of t can be changed to u without changing the value of
the basic block.
Such a block is called a normal-form block.
d) Interchange of statements:
Suppose a block has the following two adjacent statements:
t1 : = b + c
t2 : = x + y
We can interchange the two statements without affecting the value of the block if and
only if neither x nor y is t
1
and neither b nor c is t
2
.
2. Algebraic transformations:
Algebraic transformations can be used to change the set of expressions computed by a basic
block into an algebraically equivalent set.
Examples:
i) x : = x + 0 or x : = x * 1 can be eliminated from a basic block without changing the set of
expressions it computes.
ii) The exponential statement x : = y * * 2 can be replaced by x : = y * y.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
Flow Graphs
Flow graph is a directed graph containing the flow-of-control information for the set of
basic blocks making up a program.
The nodes of the flow graph are basic blocks. It has a distinguished initial node.
E.g.: Flow graph for the vector dot product is given as follows:
prod : = 0
B1
i : = 1
t1 : = 4 * i
t2 : = a [ t1 ]
t3 : = 4 * i
B2
t4 : = b [ t3 ]
t5 : = t2 * t4
t6 : = prod + t5
prod : = t6
t7 : = i + 1
i : = t7
if i <= 20 goto B2
B
1
is the initial node. B
2
immediately follows B
1
, so there is an edge from B
1
to B
2
. The
target of jump from last statement of B
1
is the first statement B
2
, so there is an edge from
B
1
(last statement) to B
2
(first statement).
B
1
is the predecessor of B
2
, and B
2
is a successor of B
1
.
Loops
A loop is a collection of nodes in a flow graph such that
1. All nodes in the collection are strongly connected.
2. The collection of nodes has a unique entry.
A loop that contains no other loops is called an inner loop.
NEXT-USE INFORMATION
If the name in a register is no longer needed, then we remove the name from the register
and the register can be used to store some other names.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
Input: Basic block B of three-address statements
Output: At each statement i: x= y op z, we attach to i the liveliness and next-uses of x,
y and z.
Method: We start at the last statement of B and scan backwards.
1. Attach to statement i the information currently found in the symbol table
regarding the next-use and liveliness of x, y and z.
2. In the symbol table, set x to not live” and “no next use”.
3. In the symbol table, set y and z to live, and next-uses of y and z to i.
Symbol Table:
Names
Liveliness
Next-use
x
not live
no next-use
y
Live
i
z
Live
i
A SIMPLE CODE GENERATOR
A code generator generates target code for a sequence of three- address statements and
effectively uses registers to store operands of the statements.
For example: consider the three-address statement a := b+c
It can have the following sequence of codes:
ADD R
j
, R
i
Cost = 1 // if R
i
contains b and R
j
contains c
(or)
ADD c, R
i
Cost = 2 // if c is in a memory location
(or)
MOV c, R
j
Cost = 3 // move c from memory to Rj and add
ADD R
j
, R
i
Register and Address Descriptors:
A register descriptor is used to keep track of what is currently in each registers. The
register descriptors show that initially all the registers are empty.
An address descriptor stores the location where the current value of the name can be
found at run time.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
A code-generation algorithm:
The algorithm takes as input a sequence of three-address statements constituting a basic block.
For each three-address statement of the form x : = y op z, perform the following actions:
1. Invoke a function getreg to determine the location L where the result of the computation y
op
z
should be stored.
2. Consult the address descriptor for y to determine y, the current location of y. Prefer the
register for y if the value of y is currently both in memory and a register. If the value of y is
not already in L, generate the instruction MOV y’ , L to place a copy of y in L.
3. Generate the instruction OP z , L where z is a current location of z. Prefer a register to a
memory location if z is in both. Update the address descriptor of x to indicate that x is in
location L. If x is in L, update its descriptor and remove x from all other descriptors.
4. If the current values of y or z have no next uses, are not live on exit from the block, and are in
registers, alter the register descriptor to indicate that, after execution of x : = y op z , those
registers will no longer contain y or z.
Generating Code for Assignment Statements:
The assignment d : = (a-b) + (a-c) + (a-c) might be translated into the following three-
address code sequence:
t : = a
b
u : = a
c
v
: = t +
u
d : = v +
u
with d live at the end.
Code sequence for the example is:
Statements
Code Generated
Register descriptor
Address descriptor
Register empty
t : = a - b
MOV a, R
0
SUB b, R0
R
0
contains t
t in R
0
u : = a - c
MOV a , R1
SUB c , R1
R
0
contains t
R1 contains u
t in R
0
u in R1
v : = t + u
ADD R
1
, R
0
R
0
contains v
R
1
contains u
u in R
1
v in R
0
d : = v + u
ADD R
1
, R
0
MOV R
0
, d
R
0
contains d
d in R
0
d in R
0
and memory
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
Generating Code for Indexed Assignments
The table shows the code sequences generated for the indexed assignment statements
a : = b [ i ] and a [ i ] : = b
Statements
Code Generated
Cost
a : = b[i]
MOV b(R
i
), R
2
a[i] : = b
MOV b, a(R
i
)
3
Generating Code for Pointer Assignments
The table shows the code sequences generated for the pointer assignments
a : = *p and *p : = a
Statements
Code Generated
Cost
a : = *p
MOV *R
p
, a
2
*p : = a
MOV a, *R
p
2
Generating Code for Conditional Statements
Statement
Code
if x < y goto z
CMP x, y
CJ< z /* jump to z if condition code
is negative */
x : = y +z
if x < 0 goto z
MOV y, R
0
ADD z, R
0
MOV R
0
,x
CJ< z
THE DAG REPRESENTATION FOR BASIC BLOCKS
A DAG for a basic block is a directed acyclic graph with the following labels on nodes:
1. Leaves are labeled by unique identifiers, either variable names or constants.
2. Interior nodes are labeled by an operator symbol.
3. Nodes are also optionally given a sequence of identifiers for labels to store the
computed values.
DAGs are useful data structures for implementing transformations on basic blocks.
It gives a picture of how the value computed by a statement is used in subsequent
statements.
It provides a good way of determining common sub - expressions.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
CS6660 COMPILER DESIGN UNIT V
Algorithm for construction of DAG
Input: A basic block
Output: A DAG for the basic block containing the following information:
1. A label for each node. For leaves, the label is an identifier. For interior nodes, an
operator symbol.
2. For each node a list of attached identifiers to hold the computed values.
Case (i) x : = y OP z
Case (ii) x : = OP y
Case (iii) x : = y
Method:
Step 1: If y is undefined then create node(y).
If z is undefined, create node(z) for case(i).
Step 2: For the case(i), create a node(OP) whose left child is node(y) and right child is
node(z). ( Checking for common sub expression). Let n be this node.
For case(ii), determine whether there is node(OP) with one child node(y). If not create such
a node.
For case(iii), node n will be node(y).
Step 3: Delete x from the list of identifiers for node(x). Append x to the list of attached
identifiers for the node n found in step 2 and set node(x) to n.
Example: Consider the block of three- address statements:
1. t
1
:= 4* i
2. t
2
:= a[t
1
]
3. t
3
:= 4* i
4. t
4
:= b[t
3
]
5. t
5
:= t
2
*t
4
6. t
6
:= prod+t
5
7. prod := t
6
8. t
7
:= i+1
9. i := t
7
10. if i<=20 goto (1)
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Stages in DAG Construction
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Application of DAGs:
1. We can automatically detect common sub expressions.
2. We can determine which identifiers have their values used in the block.
3. We can determine which statements compute values that could be used outside the block.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
GENERATING CODE FROM DAGs
The advantage of generating code for a basic block from its dag representation is that,
from a dag we can easily see how to rearrange the order of the final computation sequence than
we can starting from a linear sequence of three-address statements or quadruples.
Rearranging the order
The order in which computations are done can affect the cost of resulting object code.
For example, consider the following basic block:
t
1
: = a + b
t
2
: = c + d
t
3
: = e t
2
t
4
: = t
1
t
3
Generated code sequence for basic block:
MOV a , R
0
ADD b , R
0
MOV c , R
1
ADD d , R
1
MOV R
0
, t
1
MOV e , R
0
SUB R
1
, R
0
MOV t
1
, R
1
SUB R
0
, R
1
MOV R
1
, t
4
Rearranged basic block:
Now t1 occurs immediately before t4.
t
2
: = c + d
t
3
: = e t
2
t
1
: = a + b
t
4
: = t
1
t
3
Revised code sequence:
MOV c , R
0
ADD d , R
0
MOV a , R
0
SUB R
0
, R
1
MOV a , R
0
ADD b , R
0
SUB R
1
, R
0
MOV R
0
, t
4
In this order, two instructions MOV R
0
, t
1
and MOV t
1
, R
1
have been saved.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
A Heuristic ordering for Dags
The heuristic ordering algorithm attempts to make the evaluation of a node immediately follow
the evaluation of its leftmost argument.
The algorithm shown below produces the ordering in reverse.
Algorithm:
1) while unlisted interior nodes remain do begin
2) select an unlisted node n, all of whose parents have been listed;
3) list n;
4) while the leftmost child m of n has no unlisted parents and is not a leaf do
begin
5) list m;
6) n : = m
end
end
Example: Consider the DAG shown below:
1
*
2
+ -
3
4
*
5
- +
8
6
+
7
c d
11
e
12
a b
9 10
Initially, the only node with no unlisted parents is 1 so set n=1 at line (2) and list 1 at line (3).
Now, the left argument of 1, which is 2, has its parents listed, so we list 2 and set n=2 at line (6).
Now, at line (4) we find the leftmost child of 2, which is 6, has an unlisted parent 5. Thus we
select a new n at line (2), and node 3 is the only candidate. We list 3 and proceed down its left
chain, listing 4, 5 and 6. This leaves only 8 among the interior nodes so we list that.
The resulting list is 1234568 and the order of evaluation is 8654321.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com
Code
sequence:
t
8
: = d + e
t
6
: = a + b
t
5
: = t
6
c
t
4
: = t
5
* t
8
t
3
: = t4 e
t
2
: = t
6
+ t
4
t
1
: = t
2
* t
3
This will yield an optimal code for the DAG on machine whatever be the
n
umber of registers.
www.rejinpaul.com
Get useful study materials from www.rejinpaul.com